Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. #43

IslandGod · 2017-04-06T12:23:52Z

I run the below code (likes the code from #38):
spark-submit --jars ~/GPUEnabler-master/gpu-enabler/target/gpu-enabler_2.11-1.0.0.jar --executor-cores=4 --total-executor-cores 60 --executor-memory 6g --class com.ibm.gpuenabler.SparkGPULR ~/GPUEnabler-master/examples/target/gpu-enabler-examples_2.11-1.0.0.jar spark://10.1.12.201:7077 60 1000000 400 5
and got an error when it run GPU iteration 1, the following is my log.


WARN: This is a naive implementation of Logistic Regression and is given as an example!
Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
for more conventional use.
      
Data generation done                                                            
numSlices=60, N=1000000, D=400, ITERATIONS=5
GPU iteration 1
[Stage 1:=>                                                       (2 + 58) / 60]17/04/06 19:59:19 ERROR TaskSchedulerImpl: Lost executor 3 on 10.1.12.201: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 1:=========>                                              (10 + 50) / 60]17/04/06 19:59:30 ERROR TaskSchedulerImpl: Lost executor 0 on 10.1.12.212: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/06 19:59:30 ERROR TaskSchedulerImpl: Lost executor 1 on 10.1.12.203: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/06 19:59:32 ERROR TaskSchedulerImpl: Lost executor 4 on 10.1.12.210: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 1:=========>                                              (10 + 50) / 60]17/04/06 20:01:44 ERROR TaskSchedulerImpl: Lost executor 2 on 10.1.12.204: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/06 20:01:44 ERROR TaskSetManager: Task 17 in stage 1.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in stage 1.0 failed 4 times, most recent failure: Lost task 17.3 in stage 1.0 (TID 152, 10.1.12.204): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
	at com.ibm.gpuenabler.CUDARDDImplicits$CUDARDDFuncs.reduceExtFunc(CUDARDDUtils.scala:347)
	at com.ibm.gpuenabler.SparkGPULR$$anonfun$main$1.apply$mcVI$sp(SparkGPULR.scala:120)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
	at com.ibm.gpuenabler.SparkGPULR$.main(SparkGPULR.scala:112)
	at com.ibm.gpuenabler.SparkGPULR.main(SparkGPULR.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Is it an out of Memory errors?
Another question:
I am not quite sure that whether it really runs on the clusters ,or just run locally.
Please give me some suggestion or solution. Any info will help, thank you !!

The text was updated successfully, but these errors were encountered:

a-agrz · 2017-07-13T13:21:18Z

Hi !
I had the same problem "Exception in thread "main" ... " when I was trying to run spark with 2 worker JVMs in the same machine..
I solved it by specifying the path to the GPEUEnabler jar as described in the link bellow. So even if we are not working in a cluster we need to specify the path of GPUEnabler jar

#38

regards

Abdallah

josiahsams · 2017-07-19T13:11:52Z

@IslandGod, Yes it could be due to CUDA memory error. You can check the same in the executor logs from the Spark UI. Our implementation is limited to GPU's memory. Reduce the input datasize and try it.

sidtandon2014 · 2019-06-26T06:56:08Z

I am also facing the same error. Tried multiple things like setting the property "spark.executor.memoryOverhead", but nothing works. Has anyone resolved this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. #43

Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. #43

IslandGod commented Apr 6, 2017

a-agrz commented Jul 13, 2017 •

edited

Loading

josiahsams commented Jul 19, 2017

sidtandon2014 commented Jun 26, 2019

Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. #43

Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. #43

Comments

IslandGod commented Apr 6, 2017

a-agrz commented Jul 13, 2017 • edited Loading

josiahsams commented Jul 19, 2017

sidtandon2014 commented Jun 26, 2019

a-agrz commented Jul 13, 2017 •

edited

Loading