Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Heterogeneous environment #38

Merged
merged 2 commits into from
Apr 21, 2017

Conversation

josiahsams
Copy link
Member

This enhancement enables GPUEnabler plugin to recognize the environment it runs(feature #5). The environment can be any one of the following,

  1. Single GPU attached to the node.
  2. Multiple GPUs attached to the node.
  3. Multiple GPUs spread across the Spark cluster.
  4. Spark Cluster with nodes with GPU attached and w/o GPUs.

By default, all the spark jobs submitted in local mode will use GPU device 0, if GPU is attached to the node.

If the spark job is submitted to a Scheduler, then the executor will choose the GPU device based on the executor ID assigned to it by the scheduler. For example, in a node with 4 GPUs attached, a submitted spark job which demands 4 executors to complete its task will spawn executors each assigned to individual GPUs and will run the task to completion. If 8 executors are demanded by a Spark job, each GPU will be assigned to 2 executors and they will used in parallel.

To submit a job to a spark cluster make sure the GPUEnabler jar is placed in all the nodes and
the file ${SPARK_HOME}/conf/spark-defaults.conf in respective nodes contains the location of the jar as part of the config parameter spark.executor.extraClassPath as follows,

 spark.executor.extraClassPath    /home/joe/GPUEnabler/gpu-enabler/target/gpu-enabler_2.11-1.0.0.jar

Then spark-submit can be used to submit the job as follows,

~/spark/bin/spark-submit 
      --executor-cores=4 
      --total-executor-cores 60 
      --executor-memory=6g 
      --class com.ibm.gpuenabler.SparkGPULR 
      --jars  ~/GPUEnabler/gpu-enabler/target/gpu-enabler_2.11-1.0.0.jar  
      ~/GPUEnabler/examples/target/gpu-enabler-examples_2.11-1.0.0.jar spark://soe15:7077 60 1000000 400 5

Note:
Make sure to give sufficient memory to executors when dealing with huge datasets to avoid Out of Memory errors.

@a-agrz
Copy link

a-agrz commented Apr 11, 2017

Hi !
I have a cluster with 4 nodes one of them is atached to two GPUs. I tried to use Spark with GPU-enabler in order to exploit my GPUs. I used the same command but it doesn't work. I'm getting this errors from one of the node without GPU :
7/04/11 10:33:24 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 8.7 KB, free 366.3 MB)
17/04/11 10:33:26 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
org.apache.spark.SparkException: Could not initialize CUDA because of unknown reason
at com.ibm.gpuenabler.CUDAManager.(CUDAManager.scala:53)
at com.ibm.gpuenabler.GPUSparkEnv.(GPUSparkEnv.scala:42)
at com.ibm.gpuenabler.GPUSparkEnv$.initalize(GPUSparkEnv.scala:58)
at com.ibm.gpuenabler.GPUSparkEnv$.get(GPUSparkEnv.scala:66)
at com.ibm.gpuenabler.HybridIterator.cachedGPUPointers(HybridIterator.scala:98)
at com.ibm.gpuenabler.HybridIterator$$anonfun$5.apply(HybridIterator.scala:255)
at com.ibm.gpuenabler.HybridIterator$$anonfun$5.apply(HybridIterator.scala:254)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.ibm.gpuenabler.HybridIterator.(HybridIterator.scala:254)
at com.ibm.gpuenabler.MapGPUPartitionsRDD.compute(CUDARDDUtils.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: jcuda.CudaException: CUDA_ERROR_NO_DEVICE
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:312)
at jcuda.driver.JCudaDriver.cuInit(JCudaDriver.java:439)
at com.ibm.gpuenabler.CUDAManager.(CUDAManager.scala:43)
... 22 more
17/04/11 10:33:28 INFO CoarseGrainedExecutorBackend: Got assigned task 5
17/04/11 10:33:28 INFO Executor: Running task 0.3 in stage 1.0 (TID 5)
17/04/11 10:33:28 ERROR Executor: Exception in task 0.3 in stage 1.0 (TID 5)
java.lang.NullPointerException
at com.ibm.gpuenabler.HybridIterator.cachedGPUPointers(HybridIterator.scala:98)
at com.ibm.gpuenabler.HybridIterator$$anonfun$5.apply(HybridIterator.scala:255)
at com.ibm.gpuenabler.HybridIterator$$anonfun$5.apply(HybridIterator.scala:254)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.ibm.gpuenabler.HybridIterator.(HybridIterator.scala:254)
at com.ibm.gpuenabler.MapGPUPartitionsRDD.compute(CUDARDDUtils.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/04/11 10:33:28 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
17/04/11 10:33:28 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
tdown

PS: I tried the example in local mode on the node with two GPUs and it worked perfectly.
Does this mean your GPU-nabler doesn't support heterogeneous environment ?

Thanks in advance

AAgr

@kmadhugit
Copy link
Member

kmadhugit commented Apr 11, 2017 via email

@kavanabhat
Copy link
Collaborator

Have you tried with the above commit?

@a-agrz
Copy link

a-agrz commented Apr 12, 2017

Yes I did and it's not working either

@josiahsams
Copy link
Member Author

josiahsams commented Apr 17, 2017

@a-agrz, please make sure you adhere to the following steps mentioned in the PR:

To submit a job to a spark cluster make sure the GPUEnabler jar is placed in all the nodes and
the file ${SPARK_HOME}/conf/spark-defaults.conf in respective nodes contains the location of the jar as part of the config parameter spark.executor.extraClassPath as follows,

spark.executor.extraClassPath /home/joe/GPUEnabler/gpu-enabler/target/gpu-enabler_2.11-1.0.0.jar


If you continue to get exception as follows, please do paste the stack.

org.apache.spark.SparkException: Could not initialize CUDA because of
unknown reason

@a-agrz
Copy link

a-agrz commented Apr 18, 2017

@josiahsams I did exactly what you said. But it's not working.
It seems that the application starts will but when a task starts to be executed in one of the node without GPU it abort the execution. In my opinion, the library can't passe the normal version of the code (spark's map & reduce) when the GPU is not available..
``
here are the errors I've got :

spark-submit --executor-cores=4 --total-executor-cores 60 --executor-memory=6g --class com.ibm.gpuenabler.SparkGPULR --jars /home_nfs/aguerzaa/gpu-enabler_hetergEnv/GPUEnabler-multi1/gpu-enabler/target/gpu-enabler_2.11-1.0.0.jar /home_nfs/aguerzaa/gpu-enabler_hetergEnv/GPUEnabler-multi1/examples/target/gpu-enabler-examples_2.11-1.0.0.jar spark://naboo19:7077 60 10000 400 5
WARN: This is a naive implementation of Logistic Regression and is given as an example!
Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
for more conventional use.

Data generation done
numSlices=60, N=10000, D=400, ITERATIONS=5
GPU iteration 1
[Stage 1:> (0 + 60) / 60]17/04/18 16:52:53 ERROR TaskSetManager: Task 48 in stage 1.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 48 in stage 1.0 failed 4 times, most recent failure: Lost task 48.3 in stage 1.0 (TID 168, 10.0.0.202, executor 1): java.lang.NullPointerException
at com.ibm.gpuenabler.GPUSparkEnv$.get(GPUSparkEnv.scala:70)
at com.ibm.gpuenabler.MapGPUPartitionsRDD.compute(CUDARDDUtils.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at com.ibm.gpuenabler.CUDARDDImplicits$CUDARDDFuncs.reduceExtFunc(CUDARDDUtils.scala:366)
at com.ibm.gpuenabler.SparkGPULR$$anonfun$main$1.apply$mcVI$sp(SparkGPULR.scala:120)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at com.ibm.gpuenabler.SparkGPULR$.main(SparkGPULR.scala:112)
at com.ibm.gpuenabler.SparkGPULR.main(SparkGPULR.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
at com.ibm.gpuenabler.GPUSparkEnv$.get(GPUSparkEnv.scala:70)
at com.ibm.gpuenabler.MapGPUPartitionsRDD.compute(CUDARDDUtils.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

@josiahsams
Copy link
Member Author

@a-agrz , I ran a quick test on one Power Server & Mac both of them without GPUs attached and this plugin works fine.

I will continue with some more tests to see why you get an exception.

I've followed these steps to pull all the commits from this PR,

git clone [email protected]:IBMSparkGPU/GPUEnabler.git
cd GPUEnabler
git fetch origin pull/38/head:multi1
git checkout multi1
./compile.sh

To test it,
./bin/run-example SparkGPULR

@kmadhugit kmadhugit merged commit 05236aa into IBMSparkGPU:master Apr 21, 2017
@a-agrz
Copy link

a-agrz commented Apr 21, 2017

@josiahsams I did exactly what you said but I've got these errors below.
I'm running a standalone spark cluster on 3 nodes none of them has GPU.
But I don't understand how it's going to be lunched on the cluster if we don't use spark-submit ??

$ ./bin/run-example SparkGPULR
Executing : mvn -q scala:run -DmainClass=com.ibm.gpuenabler.SparkGPULR -DaddArgs="local[*]"
[debug] execute contextualize
[debug] execute contextualize
WARN: This is a naive implementation of Logistic Regression and is given as an example!
Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
for more conventional use.

java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org_scala_tools_maven_executions.MainHelper.runMain(MainHelper.java:161)
at org_scala_tools_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
Caused by: org.apache.spark.SparkException: Could not initialize CUDA because of unknown reason
at com.ibm.gpuenabler.CUDAManager.(CUDAManager.scala:55)
at com.ibm.gpuenabler.GPUSparkEnv.(GPUSparkEnv.scala:42)
at com.ibm.gpuenabler.GPUSparkEnv$.initalize(GPUSparkEnv.scala:59)
at com.ibm.gpuenabler.GPUSparkEnv$.get(GPUSparkEnv.scala:67)
at com.ibm.gpuenabler.CUDAFunction.(CUDAFunction.scala:219)
at com.ibm.gpuenabler.SparkGPULR$.main(SparkGPULR.scala:76)
at com.ibm.gpuenabler.SparkGPULR.main(SparkGPULR.scala)
... 6 more
Caused by: jcuda.CudaException: CUDA_ERROR_NO_DEVICE
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:312)
at jcuda.driver.JCudaDriver.cuInit(JCudaDriver.java:439)
at com.ibm.gpuenabler.CUDAManager.(CUDAManager.scala:45)
... 12 more
[ERROR] Failed to execute goal org.scala-tools:maven-scala-plugin:2.15.2:run (default-cli) on project gpu-enabler-examples_2.11: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 240(Exit value: 240) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Regards

a-agrz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants