-
Over the past couple days I've been using azure-databricks to create long-running clusters and execute lots of .Net for Spark on them. The REST API that I use is Submit Jar Task / Submit Runs :
I notice that my .Net processes all seem to clean up after themselves. But there is a massive java process on the driver node that will grow bigger and bigger until the whole cluster becomes unstable. Here is an example . ps -o pid,user,vsz,rss,comm -p 2075
Notice that the RSS is 46 GB (this is hosted on a DS5_v2 with 56 GB memory and 16 cores). jps says the process is : 2075 DriverDaemon I'm running these thru the %sh functionality in their notebooks and confirmed that this runs on the driver node. Is this a known issue? Is there a workaround that would allow me to free some of this memory? I already plan to cycle the cluster every half hour, but the memory leak I'm seeing is so fast that I may need to do it faster than that. Any help would be appreciated ... either for the original problem or even help with a workaround that would allow me to detect a saturated cluster, and cycle it prematurely. ...As a side, I've noticed that the databricks clusters are different than standalone ones that I use on my local workstation. There is only one "application" in a databricks "all-purpose" cluster, and it seems to be reused for all jobs/runs. I was contemplating a way to recreate this issue on my own workstation, but I think the azure-databricks technology is substantially different. It is not very similar to a basic installation of apache spark. I'm not confident that I would be able to recreate this exact scenario. But perhaps it would be analogous to just run my entire driver program in a loop 1000 times, within a single application on a standalone cluster. Would that be a reasonable comparison? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
What version of .NET for Spark are you using? There was |
Beta Was this translation helpful? Give feedback.
What version of .NET for Spark are you using? There was
Fix for a memory leak in JVMObjectTracker
#801 that was merged and in v1.1.1 that may be related.