[BUG] OutOfMemoryError - Maximum pool size exceeded while running 24 day criteo ETL Transform stage #1274

ericrife · 2020-12-04T17:56:26Z

Criteo benchmarking uses a 3 stage ETL. Generate Models using all 24 days, transform using 23 days to create training data and a transform using 1 day to create testing data. Benchmark succeeds on model generation but fails during first transform with error below.

This job is being run using Google Dataproc. We are seeing this with two different clusters.
4x n1-standard-32 with each node containing 4x T4
1x a2-megagpu-16g

Dataset is located within Google Cloud Storage (not local)

Here is the code that is being run:
CMD_PARAM="--master $MASTER
--driver-memory 10G
--executor-cores 8
--conf spark.cores.max=128
--num-executors 16
--conf spark.task.cpus=1
--conf spark.task.resource.gpu.amount=.125
--conf spark.executor.extraJavaOptions='-Dai.rapids.cudf.prefer-pinned=true'
--conf spark.executor.heartbeatInterval=300s
--conf spark.storage.blockManagerSlaveTimeoutMs=3600s
--conf spark.executor.resource.gpu.amount=1
--conf spark.sql.extensions=com.nvidia.spark.rapids.SQLExecPlugin
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.rapids.sql.concurrentGpuTasks=2
--conf spark.rapids.sql.batchSizeRows=4000000
--conf spark.rapids.sql.reader.batchSizeRows=4000000
--conf spark.rapids.memory.pinnedPool.size=8g
--conf spark.sql.autoBroadcastJoinThreshold=1GB
--conf spark.rapids.sql.incompatibleOps.enabled=true
--conf spark.sql.files.maxPartitionBytes=1G
--conf spark.driver.maxResultSize=2G
--conf spark.locality.wait=0s
--conf spark.network.timeout=1800s
--conf spark.executor.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}:${SPARK_CUDF_JAR}
--conf spark.driver.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}:${SPARK_CUDF_JAR}
$S3PARAMS"

spark-submit $CMD_PARAM
--conf spark.sql.shuffle.partitions=600
$SCRIPT --mode generate_models
--input_folder $INPUT_PATH
--frequency_limit 15
--debug_mode
--days $STARTDAY-$ENDDAY
--model_folder $OUTPUT_PATH/models
--write_mode overwrite --low_mem 2>&1 | tee -a $LOGFILE

spark-submit $CMD_PARAM
--conf spark.sql.shuffle.partitions=600
$SCRIPT --mode transform
--input_folder $INPUT_PATH
--debug_mode
--days $STARTDAY-$TRANSENDDAY
--output_folder $OUTPUT_PATH/train
--model_folder $OUTPUT_PATH/models
--write_mode overwrite --low_mem 2>&1 | tee -a $LOGFILE

spark-submit $CMD_PARAM
--conf spark.sql.shuffle.partitions=30
$SCRIPT --mode transform
--input_folder $INPUT_PATH
--debug_mode
--days $ENDDAY-$ENDDAY
--output_folder $OUTPUT_PATH/test
--output_ordering input
--model_folder $OUTPUT_PATH/models
--write_mode overwrite --low_mem 2>&1 | tee -a $LOGFILE

20/11/30 16:46:04 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 9.0 in stage 104.0 (TID 417, t4-cluster-w-0.c.data-science-enterprise.internal, executor 7): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc:
RMM failure at: /usr/local/rapids/include/rmm/mr/device/pool_memory_resource.hpp:167: Maximum pool size exceeded
at ai.rapids.cudf.Table.readCSV(Native Method)
at ai.rapids.cudf.Table.readCSV(Table.java:541)
at com.nvidia.spark.rapids.CSVPartitionReader.$anonfun$readToTable$2(GpuBatchScanExec.scala:465)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.CSVPartitionReader.withResource(GpuBatchScanExec.scala:319)
at com.nvidia.spark.rapids.CSVPartitionReader.readToTable(GpuBatchScanExec.scala:464)
at com.nvidia.spark.rapids.CSVPartitionReader.$anonfun$readBatch$1(GpuBatchScanExec.scala:430)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.CSVPartitionReader.withResource(GpuBatchScanExec.scala:319)
at com.nvidia.spark.rapids.CSVPartitionReader.readBatch(GpuBatchScanExec.scala:428)
at com.nvidia.spark.rapids.CSVPartitionReader.next(GpuBatchScanExec.scala:487)
at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:35)
at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:37)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:373)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:189)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:206)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Criteo OOM Logs.zip

GaryShen2008 · 2020-12-06T06:21:14Z

@ericrife Could you please verify it again by changing spark.sql.autoBroadcastJoinThreshold to 256M?

ericrife · 2020-12-07T15:03:39Z

@GaryShen2008 same error after changing the broadcast to 256M.

tgravescs · 2020-12-07T15:53:58Z

can you run "spark.version" to print the spark version to verify and let us know what that says. I don't know if we have tested on Dataproc spark 3.0.1 cluster, it should have failed though if the version didn't match something in our shim layer unless its being override

ericrife · 2020-12-07T16:39:41Z

@tgravescs - I updated the comment. It is 3.0.0 in dataproc. Sorry for the misinformation.

I have tried the following other things, each with no success.

Update task.cpus to 4, update concurrent gpu to 1 @GaryShen2008

GaryShen2008 · 2020-12-08T12:55:03Z

As my testing, I succeeded to run it by the parameter of "--conf spark.task.cpus=8", which means only one task per executor.
@ericrife Could you please double confirm it?
Below are my parameters.

CMD_PARAM="--master $MASTER
--driver-memory ${DRIVER_MEMORY}G
--executor-cores 8
--num-executors 16
--executor-memory=40G
--conf spark.task.cpus=8
--conf spark.task.resource.gpu.amount=$RESOURCE_GPU_AMT
--conf spark.executor.extraJavaOptions='-Dai.rapids.cudf.prefer-pinned=true'
--conf spark.executor.heartbeatInterval=300s
--conf spark.storage.blockManagerSlaveTimeoutMs=3600s
--conf spark.executor.resource.gpu.amount=1
--conf spark.sql.extensions=com.nvidia.spark.rapids.SQLExecPlugin
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.rapids.sql.concurrentGpuTasks=1
--conf spark.rapids.sql.batchSizeRows=4000000
--conf spark.rapids.sql.reader.batchSizeRows=4000000
--conf spark.rapids.memory.pinnedPool.size=8g
--conf spark.sql.autoBroadcastJoinThreshold=1G
--conf spark.rapids.sql.incompatibleOps.enabled=true
--conf spark.sql.files.maxPartitionBytes=4G
--conf spark.driver.maxResultSize=2G
--conf spark.locality.wait=0s
--conf spark.network.timeout=1800s
--conf spark.executor.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}:${SPARK_CUDF_JAR}
--conf spark.driver.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}:${SPARK_CUDF_JAR}
$S3PARAMS"

revans2 · 2020-12-08T13:55:45Z

My guess then is that it has to deal with the broadcasts. The best way to know for sure is to get a heap dump when we run out of memory. You can set the config spark.rapids.memory.gpu.oomDumpDir to a directory you want the dump to go in and the plugin should handle it for you. After that we can probably track down if it is the broadcasts that we know leak or not.

Part of the reason I suspect it is because we did some more investigation and found that with a performance fix we put in to 0.3 the memory usage became proportional to the number of threads * number of broadcasts. We have plans to fix it in 0.4, because the performance "fix" was a work around to a cudf issue rapidsai/cudf#6052. Until once this is fixed we hope to also make it so that the broadcast data is spillable #836.

ericrife · 2020-12-08T15:46:14Z

@GaryShen2008 This works. I had to modify my numbers a bit differently than you did do get the full resources from yarn, but ultimately the correct answer is to make task.cpus = executor-cores

Thanks for the assistance.

sameerz · 2020-12-08T21:52:04Z

Let's keep this open until rapidsai/cudf#6052 is resolved and we can remove the existing code to filter nulls which is inefficient. @GaryShen2008, please reassign to me when this is resolved with @ericrife so we can track for 0.4.

chenrui17 · 2021-01-14T09:17:57Z

@sameerz this should be closed ? i see this improve was merged #754

sameerz · 2021-01-14T21:22:57Z

@ericrife would you mind rerunning the query and seeing whether the recent fixes helped?

ericrife · 2021-01-20T17:49:53Z

@sameerz - I have run this again with the 0.3 jar and was able to run this with both task.cpus=1 and task.cpus=exec-cores.

This would indicate that the OOM issue has been resolved in the latest release.

sameerz · 2021-02-01T20:05:55Z

Closing based on @ericrife 's comment. Please reopen if we still see the problem.

…IDIA#1274) Signed-off-by: spark-rapids automation <[email protected]>

ericrife added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 4, 2020

GaryShen2008 self-assigned this Dec 6, 2020

sameerz added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Dec 8, 2020

sameerz closed this as completed Feb 1, 2021

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to 9fe127082daefaa5b90ee56686fc2cc68aa6fa9c (NV…

e1480ad

…IDIA#1274) Signed-off-by: spark-rapids automation <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] OutOfMemoryError - Maximum pool size exceeded while running 24 day criteo ETL Transform stage #1274

[BUG] OutOfMemoryError - Maximum pool size exceeded while running 24 day criteo ETL Transform stage #1274

ericrife commented Dec 4, 2020

GaryShen2008 commented Dec 6, 2020

ericrife commented Dec 7, 2020 •

edited

Loading

tgravescs commented Dec 7, 2020

ericrife commented Dec 7, 2020

GaryShen2008 commented Dec 8, 2020

revans2 commented Dec 8, 2020

ericrife commented Dec 8, 2020

sameerz commented Dec 8, 2020 •

edited

Loading

chenrui17 commented Jan 14, 2021

sameerz commented Jan 14, 2021

ericrife commented Jan 20, 2021

sameerz commented Feb 1, 2021

[BUG] OutOfMemoryError - Maximum pool size exceeded while running 24 day criteo ETL Transform stage #1274

[BUG] OutOfMemoryError - Maximum pool size exceeded while running 24 day criteo ETL Transform stage #1274

Comments

ericrife commented Dec 4, 2020

GaryShen2008 commented Dec 6, 2020

ericrife commented Dec 7, 2020 • edited Loading

tgravescs commented Dec 7, 2020

ericrife commented Dec 7, 2020

GaryShen2008 commented Dec 8, 2020

revans2 commented Dec 8, 2020

ericrife commented Dec 8, 2020

sameerz commented Dec 8, 2020 • edited Loading

chenrui17 commented Jan 14, 2021

sameerz commented Jan 14, 2021

ericrife commented Jan 20, 2021

sameerz commented Feb 1, 2021

ericrife commented Dec 7, 2020 •

edited

Loading

sameerz commented Dec 8, 2020 •

edited

Loading