DriverDaemon on azure databricks is consuming all available memory #926

dbeavon · 2021-04-23T18:39:40Z

dbeavon
Apr 23, 2021

Over the past couple days I've been using azure-databricks to create long-running clusters and execute lots of .Net for Spark on them.

The REST API that I use is Submit Jar Task / Submit Runs :

I notice that my .Net processes all seem to clean up after themselves. But there is a massive java process on the driver node that will grow bigger and bigger until the whole cluster becomes unstable.

Here is an example .

ps -o pid,user,vsz,rss,comm -p 2075



  PID USER        VSZ   RSS COMMAND
 2075 root     53153336 46004388 java

Notice that the RSS is 46 GB (this is hosted on a DS5_v2 with 56 GB memory and 16 cores).

jps says the process is : 2075 DriverDaemon

I'm running these thru the %sh functionality in their notebooks and confirmed that this runs on the driver node.

Is this a known issue? Is there a workaround that would allow me to free some of this memory? I already plan to cycle the cluster every half hour, but the memory leak I'm seeing is so fast that I may need to do it faster than that. Any help would be appreciated ... either for the original problem or even help with a workaround that would allow me to detect a saturated cluster, and cycle it prematurely.

...As a side, I've noticed that the databricks clusters are different than standalone ones that I use on my local workstation. There is only one "application" in a databricks "all-purpose" cluster, and it seems to be reused for all jobs/runs. I was contemplating a way to recreate this issue on my own workstation, but I think the azure-databricks technology is substantially different. It is not very similar to a basic installation of apache spark. I'm not confident that I would be able to recreate this exact scenario. But perhaps it would be analogous to just run my entire driver program in a loop 1000 times, within a single application on a standalone cluster. Would that be a reasonable comparison?

Answered by suhsteve

Apr 23, 2021

What version of .NET for Spark are you using? There was Fix for a memory leak in JVMObjectTracker #801 that was merged and in v1.1.1 that may be related.

View full answer

suhsteve · 2021-04-23T18:58:31Z

suhsteve
Apr 23, 2021
Maintainer

What version of .NET for Spark are you using? There was Fix for a memory leak in JVMObjectTracker #801 that was merged and in v1.1.1 that may be related.

3 replies

dbeavon Apr 25, 2021
Author

>> What version of .NET for Spark are you using?
Hi @suhsteve
I'm on 1.1.1.

In the past I have done some limited investigation of Java memory leaks, for processes that are still running. I may be able to dig into this if I'm given some tips/instructions. Ideally I would be able to troubleshoot this on my local workstation, rather than out in the databricks cluster.

In the best case I will try to come up with a minimal repro, but that can be a lot of work in itself.

Up until now I have not followed any patterns for releasing resources used by .Net for Spark. Since my driver would run as an independent process, I was under the impression that resources would be entirely released from memory after the process ended. However perhaps the JVM side still needs me to follow some protocols? I see that SparkSession implements IDisposable. Perhaps the release of the JVM -hosted resources depends on the calling of SparkSession.Dispose from within my .Net process (and/or SparkContext.Stop). Hopefully these don't cause the stopping of the all-purpose cluster altogether.
That isn't my intention.

dbeavon Apr 26, 2021
Author

Trying to call Dispose() on the databricks cluster seems like a bad idea. The cluster becomes very angry when you do that. I'm going to assume this is not the approach to use for managing memory leaks:


[2021-04-26T03:52:34.7040746Z] [0426-030651-molds28-10-129-253-21] [Error] [JvmBridge] JVM method execution failed: Nonstatic method 'collectToPython' failed for class '93' when called with no arguments
[2021-04-26T03:52:34.7042642Z] [0426-030651-molds28-10-129-253-21] [Error] [JvmBridge] org.apache.spark.SparkException: Job 19 cancelled because SparkContext was shut down
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:1158)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:1156)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
	at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:1156)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2739)
	at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2649)
	at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2188)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1561)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:2188)
	at com.databricks.sql.DatabricksSharedState.close(DatabricksSharedState.scala:116)
	at com.databricks.sql.DatabricksSharedState.close$(DatabricksSharedState.scala:114)
	at org.apache.spark.sql.internal.SharedState.close(SharedState.scala:61)
	at org.apache.spark.sql.SparkSession.stop(SparkSession.scala:777)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.api.dotnet.DotnetBackendHandler.handleMethodCall(DotnetBackendHandler.scala:165)
	at org.apache.spark.api.dotnet.DotnetBackendHandler.$anonfun$handleBackendRequest$2(DotnetBackendHandler.scala:105)
	at org.apache.spark.api.dotnet.ThreadPool$$anon$1.run(ThreadPool.scala:34)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I'm investigating some other approaches as well...

dbeavon Apr 27, 2021
Author

I was able to confirm the root problem isn't a memory leak. If there are leaks I don't think I would encounter it within the first few mins.

To troubleshoot for a leak, I set my spark.driver.memory to 30g (lower than the available physical RAM). That limit seems to be respected.

In ganglia the driver memory grows to that point and then levels off, more or less. When it levels off I don't see any special errors.

I think the "DriverDaemon" is willing to soak as much RAM as you offer it. The working set doesn't decrease even when the cluster becomes idle.

I would highly recommend setting that spark.driver.memory because it gives you a buffer between the memory used by the driver and the physical memory that is available on the server. In my case it avoided getting OOM errors and threadpool errors when there were large spikes in usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DriverDaemon on azure databricks is consuming all available memory #926

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

DriverDaemon on azure databricks is consuming all available memory #926

dbeavon Apr 23, 2021

Replies: 1 comment · 3 replies

suhsteve Apr 23, 2021 Maintainer

dbeavon Apr 25, 2021 Author

dbeavon Apr 26, 2021 Author

dbeavon Apr 27, 2021 Author

dbeavon
Apr 23, 2021

Replies: 1 comment 3 replies

suhsteve
Apr 23, 2021
Maintainer

dbeavon Apr 25, 2021
Author

dbeavon Apr 26, 2021
Author

dbeavon Apr 27, 2021
Author