Blog Record
目的: 消化优秀的博文
Spark
A step-by-step guide for debugging memory leaks in Spark Applications
Spark内存调优Blog,特点就是非常详细,Disney有个Spark任务,每三天重启一次,从executor log查起,发现了存在两种OOM,一个是G1CG引起的,使用了Spark的配置参数
spark.executor.extraJavaOptions: -XX:+UseG1GC
解决了其中的一个OOM问题,然后又详细介绍了如何排查另一个OOM,使用Ganglia进行内存监控,得到了两个结论:
This was a stateful job so maybe we were not clearing out the state over time.
A memory leak could have occurred.
然后通过监控Streaming Metrics,得到一个结论:
The conclusion: a memory leak occurred, and we needed to find it. To do so, we enabled the heap dump to see what is occupying so much memory.
后续通过堆内存转储
spark.executor.extraJavaOptions: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dbfs/heapDumps
然后分别从SparkUI界面和YourKit分析 Heap Dumps,发现里面莫名其妙的HashMap特别多,不符合逻辑,然后针对这个问题
A quick Google search and code analysis gave us our answer: we were not closing the connection correctly. The same issue has also been addressed on the aws-sdk Github issues.
哈哈 原来连Disney的工程师也是 Google 来的答案,我还以为要有什么顶级操作。