You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I run the below code (likes the code from #38): spark-submit --jars ~/GPUEnabler-master/gpu-enabler/target/gpu-enabler_2.11-1.0.0.jar --executor-cores=4 --total-executor-cores 60 --executor-memory 6g --class com.ibm.gpuenabler.SparkGPULR ~/GPUEnabler-master/examples/target/gpu-enabler-examples_2.11-1.0.0.jar spark://10.1.12.201:7077 60 1000000 400 5
and got an error when it run GPU iteration 1, the following is my log.
WARN: This is a naive implementation of Logistic Regression and is given as an example!
Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
for more conventional use.
Data generation done
numSlices=60, N=1000000, D=400, ITERATIONS=5
GPU iteration 1
[Stage 1:=> (2 + 58) / 60]17/04/06 19:59:19 ERROR TaskSchedulerImpl: Lost executor 3 on 10.1.12.201: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 1:=========> (10 + 50) / 60]17/04/06 19:59:30 ERROR TaskSchedulerImpl: Lost executor 0 on 10.1.12.212: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/06 19:59:30 ERROR TaskSchedulerImpl: Lost executor 1 on 10.1.12.203: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/06 19:59:32 ERROR TaskSchedulerImpl: Lost executor 4 on 10.1.12.210: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 1:=========> (10 + 50) / 60]17/04/06 20:01:44 ERROR TaskSchedulerImpl: Lost executor 2 on 10.1.12.204: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/06 20:01:44 ERROR TaskSetManager: Task 17 in stage 1.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in stage 1.0 failed 4 times, most recent failure: Lost task 17.3 in stage 1.0 (TID 152, 10.1.12.204): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
at com.ibm.gpuenabler.CUDARDDImplicits$CUDARDDFuncs.reduceExtFunc(CUDARDDUtils.scala:347)
at com.ibm.gpuenabler.SparkGPULR$$anonfun$main$1.apply$mcVI$sp(SparkGPULR.scala:120)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at com.ibm.gpuenabler.SparkGPULR$.main(SparkGPULR.scala:112)
at com.ibm.gpuenabler.SparkGPULR.main(SparkGPULR.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Is it an out of Memory errors?
Another question:
I am not quite sure that whether it really runs on the clusters ,or just run locally.
Please give me some suggestion or solution. Any info will help, thank you !!
The text was updated successfully, but these errors were encountered:
Hi !
I had the same problem "Exception in thread "main" ... " when I was trying to run spark with 2 worker JVMs in the same machine..
I solved it by specifying the path to the GPEUEnabler jar as described in the link bellow. So even if we are not working in a cluster we need to specify the path of GPUEnabler jar
@IslandGod, Yes it could be due to CUDA memory error. You can check the same in the executor logs from the Spark UI. Our implementation is limited to GPU's memory. Reduce the input datasize and try it.
I am also facing the same error. Tried multiple things like setting the property "spark.executor.memoryOverhead", but nothing works. Has anyone resolved this
I run the below code (likes the code from #38):
spark-submit --jars ~/GPUEnabler-master/gpu-enabler/target/gpu-enabler_2.11-1.0.0.jar --executor-cores=4 --total-executor-cores 60 --executor-memory 6g --class com.ibm.gpuenabler.SparkGPULR ~/GPUEnabler-master/examples/target/gpu-enabler-examples_2.11-1.0.0.jar spark://10.1.12.201:7077 60 1000000 400 5
and got an error when it run GPU iteration 1, the following is my log.
Is it an out of Memory errors?
Another question:
I am not quite sure that whether it really runs on the clusters ,or just run locally.
Please give me some suggestion or solution. Any info will help, thank you !!
The text was updated successfully, but these errors were encountered: