Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. #43

Open
IslandGod opened this issue Apr 6, 2017 · 3 comments

Comments

@IslandGod
Copy link

I run the below code (likes the code from #38):
spark-submit --jars ~/GPUEnabler-master/gpu-enabler/target/gpu-enabler_2.11-1.0.0.jar --executor-cores=4 --total-executor-cores 60 --executor-memory 6g --class com.ibm.gpuenabler.SparkGPULR ~/GPUEnabler-master/examples/target/gpu-enabler-examples_2.11-1.0.0.jar spark://10.1.12.201:7077 60 1000000 400 5
and got an error when it run GPU iteration 1, the following is my log.


WARN: This is a naive implementation of Logistic Regression and is given as an example!
Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
for more conventional use.
      
Data generation done                                                            
numSlices=60, N=1000000, D=400, ITERATIONS=5
GPU iteration 1
[Stage 1:=>                                                       (2 + 58) / 60]17/04/06 19:59:19 ERROR TaskSchedulerImpl: Lost executor 3 on 10.1.12.201: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 1:=========>                                              (10 + 50) / 60]17/04/06 19:59:30 ERROR TaskSchedulerImpl: Lost executor 0 on 10.1.12.212: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/06 19:59:30 ERROR TaskSchedulerImpl: Lost executor 1 on 10.1.12.203: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/06 19:59:32 ERROR TaskSchedulerImpl: Lost executor 4 on 10.1.12.210: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 1:=========>                                              (10 + 50) / 60]17/04/06 20:01:44 ERROR TaskSchedulerImpl: Lost executor 2 on 10.1.12.204: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/06 20:01:44 ERROR TaskSetManager: Task 17 in stage 1.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in stage 1.0 failed 4 times, most recent failure: Lost task 17.3 in stage 1.0 (TID 152, 10.1.12.204): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
	at com.ibm.gpuenabler.CUDARDDImplicits$CUDARDDFuncs.reduceExtFunc(CUDARDDUtils.scala:347)
	at com.ibm.gpuenabler.SparkGPULR$$anonfun$main$1.apply$mcVI$sp(SparkGPULR.scala:120)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
	at com.ibm.gpuenabler.SparkGPULR$.main(SparkGPULR.scala:112)
	at com.ibm.gpuenabler.SparkGPULR.main(SparkGPULR.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Is it an out of Memory errors?
Another question:
I am not quite sure that whether it really runs on the clusters ,or just run locally.
Please give me some suggestion or solution. Any info will help, thank you !!

@a-agrz
Copy link

a-agrz commented Jul 13, 2017

Hi !
I had the same problem "Exception in thread "main" ... " when I was trying to run spark with 2 worker JVMs in the same machine..
I solved it by specifying the path to the GPEUEnabler jar as described in the link bellow. So even if we are not working in a cluster we need to specify the path of GPUEnabler jar

#38

regards

Abdallah

@josiahsams
Copy link
Member

@IslandGod, Yes it could be due to CUDA memory error. You can check the same in the executor logs from the Spark UI. Our implementation is limited to GPU's memory. Reduce the input datasize and try it.

@sidtandon2014
Copy link

I am also facing the same error. Tried multiple things like setting the property "spark.executor.memoryOverhead", but nothing works. Has anyone resolved this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants