-
Hello, I have 4 GPUs, but when I execute Spark Rapids, I only see GPU 0 being utilized. Could this be due to an error in my PySpark parameter settings? python file: # Initialize Spark session
spark = SparkSession.builder \
.appName(experiment_name) \
.config("spark.executor.memory", "80g") \
.config("spark.driver.memory", "80g") \
.config("spark.executor.cores", 4) \
.config("spark.executor.instances", 32) \
.config("spark.default.parallelism", 128) \
.config("spark.cores.max", 128) \
.config("spark.executor.resource.gpu.discoveryScript", gpu_script_path) \
.config("spark.sql.execution.arrow.maxRecordsPerBatch", 10000) \
.config("spark.plugins", "com.nvidia.spark.SQLPlugin") \
.config("spark.rapids.sql.enabled", "true") \
.config("spark.rapids.sql.explain", "ALL") \
.config("spark.executor.resource.gpu.amount", 4) \
.config("spark.rapids.sql.concurrentGpuTasks", 2) \
.config("spark.rapids.memory.gpu.maxAllocFraction", 1) \
.config("spark.rapids.memory.gpu.allocFraction", 0.2) \
.config("spark.rapids.memory.gpu.minAllocFraction", 0.1) \
.config("spark.rapids.sql.multiThreadedRead.numThreads", 128) \
.config("spark.executor.extraClassPath", rapids_jar_path) \
.config("spark.driver.extraClassPath", rapids_jar_path) \
.getOrCreate() getGpusResources.sh: NUM_GPUS=4
ADDRS=$(seq -s ',' 0 $((NUM_GPUS - 1)) | sed 's/,/","/g')
echo '{"name": "gpu", "addresses":["'"$ADDRS"'"]}' output:
|
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 1 reply
-
Should be 1 instead. We only support running with 1 GPU per executor. You also have not configured Also just to be clear. We do not support using multiple GPUs in local mode. This is because we only support a single GPU per process right now and in local mode everything runs in a single process. |
Beta Was this translation helpful? Give feedback.
-
@onefanwu sounds good let me know if you run into any other problems. |
Beta Was this translation helpful? Give feedback.
-
@revans2 Hello, I followed your previous advice to connect PySpark with the worker in Spark Standalone mode. The worker is equipped with 3 GPUs and 128 CPU cores. However, I noticed that RAPIDS is still only using GPU 0 when executing my SQL query, and GPUs 1 and 2 are not being utilized. Could you please advise on how to make the remaining two GPUs also be used? PySpark Configurationmax_threads = 128
vector_size = 10000
rapids_jar_path = "/workdir/AiQ-dev/spark-rapids-AiQ/dist/target/rapids-4-spark_2.12-24.06.0-cuda11.jar"
getGpusResources = '/workdir/AiQ-dev/AiQ-benchmark/baseline/spark-RAPIDS/getGpusResources.sh'
# Function to stop the current Spark session
def stop_spark_session(spark):
spark.stop()
# Function to create a new Spark session
def create_spark_session():
return SparkSession.builder \
.appName(experiment_name) \
.master("spark://localhost:7077") \
.config("spark.executor.memory", "80g") \
.config("spark.driver.memory", "80g") \
.config("spark.worker.resource.gpu.amount", 3)\
.config("spark.executor.resource.gpu.amount", 1) \
.config("spark.task.resource.gpu.amount", 1/4)\
.config("spark.executor.cores", 4) \
.config("spark.executor.instances", 32) \
.config("spark.default.parallelism", max_threads) \
.config("spark.cores.max", max_threads) \
.config("spark.sql.execution.arrow.maxRecordsPerBatch", vector_size) \
.config("spark.plugins", "com.nvidia.spark.SQLPlugin") \
.config("spark.rapids.sql.enabled", "true") \
.config("spark.rapids.sql.explain", "ALL") \
.config("spark.dynamicAllocation.enabled", "false") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.rapids.sql.concurrentGpuTasks", 2) \
.config("spark.rapids.memory.gpu.maxAllocFraction", 1) \
.config("spark.rapids.memory.gpu.allocFraction", 0.2) \
.config("spark.rapids.memory.gpu.minAllocFraction", 0.1) \
.config("spark.rapids.sql.multiThreadedRead.numThreads", max_threads) \
.config("spark.executor.extraClassPath", rapids_jar_path) \
.config("spark.driver.extraClassPath", rapids_jar_path) \
.config("spark.worker.resource.gpu.discoveryScript", getGpusResources) \
.getOrCreate() My SQL Queryquery = f""" getGpusResources.sh:# copy from https://github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh
ADDRS=`nvidia-smi --query-gpu=index --format=csv,noheader | sed -e ':a' -e 'N' -e'$!ba' -e 's/\n/","/g'`
echo {\"name\": \"gpu\", \"addresses\":[\"$ADDRS\"]} output: Snapshot |
Beta Was this translation helpful? Give feedback.
-
Please look at the Spark UI while it is running. Not the job UI typically on port 4040, but the master UI on port 8080 by default. It should show which GPUs are assigned to your application along with which ones are free. It should help us see what the limiting factor is, because it looks like only a single worker process was launched, which might indicate that Spark thinks it is out of host memory or CPU cores so it didn't launch more workers. |
Beta Was this translation helpful? Give feedback.
-
@revans2 Hello, thank you very much for your guidance and suggestions. Your analysis was very accurate. Initially, I set the worker memory limit to 80GB. Also, I put the executor memory to 80GB, which resulted in only one executor being launched and, thus, only one GPU being utilized, as shown in the first image. Then, I changed the worker memory limit to 256GB and kept the executor memory at 80 GB. I successfully launched three executors, and each executor utilized one GPU, effectively using all of my GPUs, as shown in the second and third images. Thank you so much. You are truly very professional. |
Beta Was this translation helpful? Give feedback.
-
@onefanwu happy to help. |
Beta Was this translation helpful? Give feedback.
Should be 1 instead. We only support running with 1 GPU per executor. You also have not configured
spark.task.resource.gpu.amount
https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/overview.html#spark-gpu-scheduling-overview
Also just to be clear. We do not support using multiple GPUs in local mode. This is because we only support a single GPU per process right now and in local mode everything runs in a single process.