-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] OutOfMemoryError - Maximum pool size exceeded while running 24 day criteo ETL Transform stage #1274
Comments
@ericrife Could you please verify it again by changing spark.sql.autoBroadcastJoinThreshold to 256M? |
@GaryShen2008 same error after changing the broadcast to 256M. |
can you run "spark.version" to print the spark version to verify and let us know what that says. I don't know if we have tested on Dataproc spark 3.0.1 cluster, it should have failed though if the version didn't match something in our shim layer unless its being override |
@tgravescs - I updated the comment. It is 3.0.0 in dataproc. Sorry for the misinformation. I have tried the following other things, each with no success. Update task.cpus to 4, update concurrent gpu to 1 @GaryShen2008 |
As my testing, I succeeded to run it by the parameter of "--conf spark.task.cpus=8", which means only one task per executor. CMD_PARAM="--master $MASTER |
My guess then is that it has to deal with the broadcasts. The best way to know for sure is to get a heap dump when we run out of memory. You can set the config Part of the reason I suspect it is because we did some more investigation and found that with a performance fix we put in to 0.3 the memory usage became proportional to the number of threads * number of broadcasts. We have plans to fix it in 0.4, because the performance "fix" was a work around to a cudf issue rapidsai/cudf#6052. Until once this is fixed we hope to also make it so that the broadcast data is spillable #836. |
@GaryShen2008 This works. I had to modify my numbers a bit differently than you did do get the full resources from yarn, but ultimately the correct answer is to make task.cpus = executor-cores Thanks for the assistance. |
Let's keep this open until rapidsai/cudf#6052 is resolved and we can remove the existing code to filter nulls which is inefficient. @GaryShen2008, please reassign to me when this is resolved with @ericrife so we can track for 0.4. |
@ericrife would you mind rerunning the query and seeing whether the recent fixes helped? |
@sameerz - I have run this again with the 0.3 jar and was able to run this with both task.cpus=1 and task.cpus=exec-cores. This would indicate that the OOM issue has been resolved in the latest release. |
Closing based on @ericrife 's comment. Please reopen if we still see the problem. |
…IDIA#1274) Signed-off-by: spark-rapids automation <[email protected]>
Criteo benchmarking uses a 3 stage ETL. Generate Models using all 24 days, transform using 23 days to create training data and a transform using 1 day to create testing data. Benchmark succeeds on model generation but fails during first transform with error below.
This job is being run using Google Dataproc. We are seeing this with two different clusters.
4x n1-standard-32 with each node containing 4x T4
1x a2-megagpu-16g
Dataset is located within Google Cloud Storage (not local)
Here is the code that is being run:
CMD_PARAM="--master $MASTER
--driver-memory 10G
--executor-cores 8
--conf spark.cores.max=128
--num-executors 16
--conf spark.task.cpus=1
--conf spark.task.resource.gpu.amount=.125
--conf spark.executor.extraJavaOptions='-Dai.rapids.cudf.prefer-pinned=true'
--conf spark.executor.heartbeatInterval=300s
--conf spark.storage.blockManagerSlaveTimeoutMs=3600s
--conf spark.executor.resource.gpu.amount=1
--conf spark.sql.extensions=com.nvidia.spark.rapids.SQLExecPlugin
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.rapids.sql.concurrentGpuTasks=2
--conf spark.rapids.sql.batchSizeRows=4000000
--conf spark.rapids.sql.reader.batchSizeRows=4000000
--conf spark.rapids.memory.pinnedPool.size=8g
--conf spark.sql.autoBroadcastJoinThreshold=1GB
--conf spark.rapids.sql.incompatibleOps.enabled=true
--conf spark.sql.files.maxPartitionBytes=1G
--conf spark.driver.maxResultSize=2G
--conf spark.locality.wait=0s
--conf spark.network.timeout=1800s
--conf spark.executor.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}:${SPARK_CUDF_JAR}
--conf spark.driver.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}:${SPARK_CUDF_JAR}
$S3PARAMS"
spark-submit $CMD_PARAM
--conf spark.sql.shuffle.partitions=600
$SCRIPT --mode generate_models
--input_folder $INPUT_PATH
--frequency_limit 15
--debug_mode
--days $STARTDAY-$ENDDAY
--model_folder $OUTPUT_PATH/models
--write_mode overwrite --low_mem 2>&1 | tee -a $LOGFILE
spark-submit $CMD_PARAM
--conf spark.sql.shuffle.partitions=600
$SCRIPT --mode transform
--input_folder $INPUT_PATH
--debug_mode
--days $STARTDAY-$TRANSENDDAY
--output_folder $OUTPUT_PATH/train
--model_folder $OUTPUT_PATH/models
--write_mode overwrite --low_mem 2>&1 | tee -a $LOGFILE
spark-submit $CMD_PARAM
--conf spark.sql.shuffle.partitions=30
$SCRIPT --mode transform
--input_folder $INPUT_PATH
--debug_mode
--days $ENDDAY-$ENDDAY
--output_folder $OUTPUT_PATH/test
--output_ordering input
--model_folder $OUTPUT_PATH/models
--write_mode overwrite --low_mem 2>&1 | tee -a $LOGFILE
20/11/30 16:46:04 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 9.0 in stage 104.0 (TID 417, t4-cluster-w-0.c.data-science-enterprise.internal, executor 7): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc:
RMM failure at: /usr/local/rapids/include/rmm/mr/device/pool_memory_resource.hpp:167: Maximum pool size exceeded
at ai.rapids.cudf.Table.readCSV(Native Method)
at ai.rapids.cudf.Table.readCSV(Table.java:541)
at com.nvidia.spark.rapids.CSVPartitionReader.$anonfun$readToTable$2(GpuBatchScanExec.scala:465)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.CSVPartitionReader.withResource(GpuBatchScanExec.scala:319)
at com.nvidia.spark.rapids.CSVPartitionReader.readToTable(GpuBatchScanExec.scala:464)
at com.nvidia.spark.rapids.CSVPartitionReader.$anonfun$readBatch$1(GpuBatchScanExec.scala:430)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.CSVPartitionReader.withResource(GpuBatchScanExec.scala:319)
at com.nvidia.spark.rapids.CSVPartitionReader.readBatch(GpuBatchScanExec.scala:428)
at com.nvidia.spark.rapids.CSVPartitionReader.next(GpuBatchScanExec.scala:487)
at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:35)
at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:37)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:373)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:189)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:206)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Criteo OOM Logs.zip
The text was updated successfully, but these errors were encountered: