Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OutOfMemoryError - Maximum pool size exceeded while running 24 day criteo ETL Transform stage #1274

Closed
ericrife opened this issue Dec 4, 2020 · 12 comments
Assignees
Labels
bug Something isn't working P1 Nice to have for release

Comments

@ericrife
Copy link

ericrife commented Dec 4, 2020

Criteo benchmarking uses a 3 stage ETL. Generate Models using all 24 days, transform using 23 days to create training data and a transform using 1 day to create testing data. Benchmark succeeds on model generation but fails during first transform with error below.

This job is being run using Google Dataproc. We are seeing this with two different clusters.
4x n1-standard-32 with each node containing 4x T4
1x a2-megagpu-16g

Dataset is located within Google Cloud Storage (not local)

Here is the code that is being run:
CMD_PARAM="--master $MASTER
--driver-memory 10G
--executor-cores 8
--conf spark.cores.max=128
--num-executors 16
--conf spark.task.cpus=1
--conf spark.task.resource.gpu.amount=.125
--conf spark.executor.extraJavaOptions='-Dai.rapids.cudf.prefer-pinned=true'
--conf spark.executor.heartbeatInterval=300s
--conf spark.storage.blockManagerSlaveTimeoutMs=3600s
--conf spark.executor.resource.gpu.amount=1
--conf spark.sql.extensions=com.nvidia.spark.rapids.SQLExecPlugin
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.rapids.sql.concurrentGpuTasks=2
--conf spark.rapids.sql.batchSizeRows=4000000
--conf spark.rapids.sql.reader.batchSizeRows=4000000
--conf spark.rapids.memory.pinnedPool.size=8g
--conf spark.sql.autoBroadcastJoinThreshold=1GB
--conf spark.rapids.sql.incompatibleOps.enabled=true
--conf spark.sql.files.maxPartitionBytes=1G
--conf spark.driver.maxResultSize=2G
--conf spark.locality.wait=0s
--conf spark.network.timeout=1800s
--conf spark.executor.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}:${SPARK_CUDF_JAR}
--conf spark.driver.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}:${SPARK_CUDF_JAR}
$S3PARAMS"

spark-submit $CMD_PARAM
--conf spark.sql.shuffle.partitions=600
$SCRIPT --mode generate_models
--input_folder $INPUT_PATH
--frequency_limit 15
--debug_mode
--days $STARTDAY-$ENDDAY
--model_folder $OUTPUT_PATH/models
--write_mode overwrite --low_mem 2>&1 | tee -a $LOGFILE

spark-submit $CMD_PARAM
--conf spark.sql.shuffle.partitions=600
$SCRIPT --mode transform
--input_folder $INPUT_PATH
--debug_mode
--days $STARTDAY-$TRANSENDDAY
--output_folder $OUTPUT_PATH/train
--model_folder $OUTPUT_PATH/models
--write_mode overwrite --low_mem 2>&1 | tee -a $LOGFILE

spark-submit $CMD_PARAM
--conf spark.sql.shuffle.partitions=30
$SCRIPT --mode transform
--input_folder $INPUT_PATH
--debug_mode
--days $ENDDAY-$ENDDAY
--output_folder $OUTPUT_PATH/test
--output_ordering input
--model_folder $OUTPUT_PATH/models
--write_mode overwrite --low_mem 2>&1 | tee -a $LOGFILE

20/11/30 16:46:04 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 9.0 in stage 104.0 (TID 417, t4-cluster-w-0.c.data-science-enterprise.internal, executor 7): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc:
RMM failure at: /usr/local/rapids/include/rmm/mr/device/pool_memory_resource.hpp:167: Maximum pool size exceeded
at ai.rapids.cudf.Table.readCSV(Native Method)
at ai.rapids.cudf.Table.readCSV(Table.java:541)
at com.nvidia.spark.rapids.CSVPartitionReader.$anonfun$readToTable$2(GpuBatchScanExec.scala:465)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.CSVPartitionReader.withResource(GpuBatchScanExec.scala:319)
at com.nvidia.spark.rapids.CSVPartitionReader.readToTable(GpuBatchScanExec.scala:464)
at com.nvidia.spark.rapids.CSVPartitionReader.$anonfun$readBatch$1(GpuBatchScanExec.scala:430)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.CSVPartitionReader.withResource(GpuBatchScanExec.scala:319)
at com.nvidia.spark.rapids.CSVPartitionReader.readBatch(GpuBatchScanExec.scala:428)
at com.nvidia.spark.rapids.CSVPartitionReader.next(GpuBatchScanExec.scala:487)
at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:35)
at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:37)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:373)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:189)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:206)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Criteo OOM Logs.zip

@ericrife ericrife added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 4, 2020
@GaryShen2008
Copy link
Collaborator

@ericrife Could you please verify it again by changing spark.sql.autoBroadcastJoinThreshold to 256M?

@GaryShen2008 GaryShen2008 self-assigned this Dec 6, 2020
@ericrife
Copy link
Author

ericrife commented Dec 7, 2020

@GaryShen2008 same error after changing the broadcast to 256M.

@tgravescs
Copy link
Collaborator

can you run "spark.version" to print the spark version to verify and let us know what that says. I don't know if we have tested on Dataproc spark 3.0.1 cluster, it should have failed though if the version didn't match something in our shim layer unless its being override

@ericrife
Copy link
Author

ericrife commented Dec 7, 2020

@tgravescs - I updated the comment. It is 3.0.0 in dataproc. Sorry for the misinformation.

I have tried the following other things, each with no success.

Update task.cpus to 4, update concurrent gpu to 1 @GaryShen2008

@GaryShen2008
Copy link
Collaborator

As my testing, I succeeded to run it by the parameter of "--conf spark.task.cpus=8", which means only one task per executor.
@ericrife Could you please double confirm it?
Below are my parameters.

CMD_PARAM="--master $MASTER
--driver-memory ${DRIVER_MEMORY}G
--executor-cores 8
--num-executors 16
--executor-memory=40G
--conf spark.task.cpus=8
--conf spark.task.resource.gpu.amount=$RESOURCE_GPU_AMT
--conf spark.executor.extraJavaOptions='-Dai.rapids.cudf.prefer-pinned=true'
--conf spark.executor.heartbeatInterval=300s
--conf spark.storage.blockManagerSlaveTimeoutMs=3600s
--conf spark.executor.resource.gpu.amount=1
--conf spark.sql.extensions=com.nvidia.spark.rapids.SQLExecPlugin
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.rapids.sql.concurrentGpuTasks=1
--conf spark.rapids.sql.batchSizeRows=4000000
--conf spark.rapids.sql.reader.batchSizeRows=4000000
--conf spark.rapids.memory.pinnedPool.size=8g
--conf spark.sql.autoBroadcastJoinThreshold=1G
--conf spark.rapids.sql.incompatibleOps.enabled=true
--conf spark.sql.files.maxPartitionBytes=4G
--conf spark.driver.maxResultSize=2G
--conf spark.locality.wait=0s
--conf spark.network.timeout=1800s
--conf spark.executor.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}:${SPARK_CUDF_JAR}
--conf spark.driver.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}:${SPARK_CUDF_JAR}
$S3PARAMS"

@revans2
Copy link
Collaborator

revans2 commented Dec 8, 2020

My guess then is that it has to deal with the broadcasts. The best way to know for sure is to get a heap dump when we run out of memory. You can set the config spark.rapids.memory.gpu.oomDumpDir to a directory you want the dump to go in and the plugin should handle it for you. After that we can probably track down if it is the broadcasts that we know leak or not.

Part of the reason I suspect it is because we did some more investigation and found that with a performance fix we put in to 0.3 the memory usage became proportional to the number of threads * number of broadcasts. We have plans to fix it in 0.4, because the performance "fix" was a work around to a cudf issue rapidsai/cudf#6052. Until once this is fixed we hope to also make it so that the broadcast data is spillable #836.

@ericrife
Copy link
Author

ericrife commented Dec 8, 2020

@GaryShen2008 This works. I had to modify my numbers a bit differently than you did do get the full resources from yarn, but ultimately the correct answer is to make task.cpus = executor-cores

Thanks for the assistance.

@sameerz sameerz added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Dec 8, 2020
@sameerz
Copy link
Collaborator

sameerz commented Dec 8, 2020

Let's keep this open until rapidsai/cudf#6052 is resolved and we can remove the existing code to filter nulls which is inefficient. @GaryShen2008, please reassign to me when this is resolved with @ericrife so we can track for 0.4.

@chenrui17
Copy link

@sameerz this should be closed ? i see this improve was merged #754

@sameerz
Copy link
Collaborator

sameerz commented Jan 14, 2021

@ericrife would you mind rerunning the query and seeing whether the recent fixes helped?

@ericrife
Copy link
Author

@sameerz - I have run this again with the 0.3 jar and was able to run this with both task.cpus=1 and task.cpus=exec-cores.

This would indicate that the OOM issue has been resolved in the latest release.

@sameerz
Copy link
Collaborator

sameerz commented Feb 1, 2021

Closing based on @ericrife 's comment. Please reopen if we still see the problem.

@sameerz sameerz closed this as completed Feb 1, 2021
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Nice to have for release
Projects
None yet
Development

No branches or pull requests

6 participants