0%

Blog Note

Blog Record

目的: 消化优秀的博文

Spark

A step-by-step guide for debugging memory leaks in Spark Applications

Spark内存调优Blog,特点就是非常详细,Disney有个Spark任务,每三天重启一次,从executor log查起,发现了存在两种OOM,一个是G1CG引起的,使用了Spark的配置参数spark.executor.extraJavaOptions: -XX:+UseG1GC解决了其中的一个OOM问题,然后又详细介绍了如何排查另一个OOM,使用Ganglia进行内存监控,得到了两个结论:

  1. This was a stateful job so maybe we were not clearing out the state over time.

  2. A memory leak could have occurred.

然后通过监控Streaming Metrics,得到一个结论:

The conclusion: a memory leak occurred, and we needed to find it. To do so, we enabled the heap dump to see what is occupying so much memory.

后续通过堆内存转储

spark.executor.extraJavaOptions: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dbfs/heapDumps

然后分别从SparkUI界面和YourKit分析 Heap Dumps,发现里面莫名其妙的HashMap特别多,不符合逻辑,然后针对这个问题

A quick Google search and code analysis gave us our answer: we were not closing the connection correctly. The same issue has also been addressed on the aws-sdk Github issues.

哈哈 原来连Disney的工程师也是 Google 来的答案,我还以为要有什么顶级操作。