-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Heterogeneous environment #38
Conversation
Hi ! PS: I tried the example in local mode on the node with two GPUs and it worked perfectly. Thanks in advance AAgr |
Yes, that is correct. This will be supported in future.
Thanks,
Madhu.
From: a-agrz <[email protected]>
To: IBMSparkGPU/GPUEnabler <[email protected]>
Cc: Subscribed <[email protected]>
Date: 04/11/2017 02:28 PM
Subject: Re: [IBMSparkGPU/GPUEnabler] Support for Heterogeneous
environment (#38)
Hi !
I have a cluster with 4 nodes one of them is atached to two GPUs. I tried
to use Spark with GPU-enabler in order to exploit my GPUs. I used the same
command but it doesn't work. I'm getting this errors from one of the node
without GPU :
7/04/11 10:33:24 INFO MemoryStore: Block broadcast_1 stored as values in
memory (estimated size 8.7 KB, free 366.3 MB)
17/04/11 10:33:26 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID
2)
org.apache.spark.SparkException: Could not initialize CUDA because of
unknown reason
at com.ibm.gpuenabler.CUDAManager.(CUDAManager.scala:53)
at com.ibm.gpuenabler.GPUSparkEnv.(GPUSparkEnv.scala:42)
at com.ibm.gpuenabler.GPUSparkEnv$.initalize(GPUSparkEnv.scala:58)
at com.ibm.gpuenabler.GPUSparkEnv$.get(GPUSparkEnv.scala:66)
at com.ibm.gpuenabler.HybridIterator.cachedGPUPointers
(HybridIterator.scala:98)
at com.ibm.gpuenabler.HybridIterator$$anonfun$5.apply
(HybridIterator.scala:255)
at com.ibm.gpuenabler.HybridIterator$$anonfun$5.apply
(HybridIterator.scala:254)
at scala.collection.TraversableLike$$anonfun$map$1.apply
(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply
(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach
(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.ibm.gpuenabler.HybridIterator.(HybridIterator.scala:254)
at com.ibm.gpuenabler.MapGPUPartitionsRDD.compute(CUDARDDUtils.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: jcuda.CudaException: CUDA_ERROR_NO_DEVICE
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:312)
at jcuda.driver.JCudaDriver.cuInit(JCudaDriver.java:439)
at com.ibm.gpuenabler.CUDAManager.(CUDAManager.scala:43)
... 22 more
17/04/11 10:33:28 INFO CoarseGrainedExecutorBackend: Got assigned task 5
17/04/11 10:33:28 INFO Executor: Running task 0.3 in stage 1.0 (TID 5)
17/04/11 10:33:28 ERROR Executor: Exception in task 0.3 in stage 1.0 (TID
5)
java.lang.NullPointerException
at com.ibm.gpuenabler.HybridIterator.cachedGPUPointers
(HybridIterator.scala:98)
at com.ibm.gpuenabler.HybridIterator$$anonfun$5.apply
(HybridIterator.scala:255)
at com.ibm.gpuenabler.HybridIterator$$anonfun$5.apply
(HybridIterator.scala:254)
at scala.collection.TraversableLike$$anonfun$map$1.apply
(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply
(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach
(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.ibm.gpuenabler.HybridIterator.(HybridIterator.scala:254)
at com.ibm.gpuenabler.MapGPUPartitionsRDD.compute(CUDARDDUtils.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/04/11 10:33:28 INFO CoarseGrainedExecutorBackend: Driver commanded a
shutdown
17/04/11 10:33:28 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
tdown
PS: I tried the example in local mode on the node with two GPUs and it
worked perfectly.
Does this mean your GPU-nabler doesn't support heterogeneous environment ?
Thanks in advance
AAgr
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Have you tried with the above commit? |
Yes I did and it's not working either |
@a-agrz, please make sure you adhere to the following steps mentioned in the PR:To submit a job to a spark cluster make sure the GPUEnabler jar is placed in all the nodes and
If you continue to get exception as follows, please do paste the stack.
|
@josiahsams I did exactly what you said. But it's not working. spark-submit --executor-cores=4 --total-executor-cores 60 --executor-memory=6g --class com.ibm.gpuenabler.SparkGPULR --jars Data generation done Driver stacktrace: |
@a-agrz , I ran a quick test on one Power Server & Mac both of them without GPUs attached and this plugin works fine. I will continue with some more tests to see why you get an exception. I've followed these steps to pull all the commits from this PR, git clone [email protected]:IBMSparkGPU/GPUEnabler.git To test it, |
@josiahsams I did exactly what you said but I've got these errors below. $ ./bin/run-example SparkGPULR java.lang.reflect.InvocationTargetException Regards a-agrz |
This enhancement enables GPUEnabler plugin to recognize the environment it runs(feature #5). The environment can be any one of the following,
By default, all the spark jobs submitted in local mode will use GPU device 0, if GPU is attached to the node.
If the spark job is submitted to a Scheduler, then the executor will choose the GPU device based on the executor ID assigned to it by the scheduler. For example, in a node with 4 GPUs attached, a submitted spark job which demands 4 executors to complete its task will spawn executors each assigned to individual GPUs and will run the task to completion. If 8 executors are demanded by a Spark job, each GPU will be assigned to 2 executors and they will used in parallel.
To submit a job to a spark cluster make sure the GPUEnabler jar is placed in all the nodes and
the file
${SPARK_HOME}/conf/spark-defaults.conf
in respective nodes contains the location of the jar as part of the config parameterspark.executor.extraClassPath
as follows,Then
spark-submit
can be used to submit the job as follows,Note:
Make sure to give sufficient memory to executors when dealing with huge datasets to avoid Out of Memory errors.