-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit at ai.rapids.cudf.Table.concatenate(Native Method) #8021
Comments
When discussing this we were a little confused if this fails randomly like the GPU memory is near the limit on what it can support and some times it works, while other times it fails, or if this looks more like a memory leak where running in X times always works, but X+1 times crashes? We are working on a way to mitigate situations like this #7778 The goal is to have this in the 23.06 release. If you want to try and test it sooner I can see if we can come up with a version you could try out. |
I would appreciate if you can provide me something to test it sooner. |
"if this fails randomly like the GPU memory is near the limit on what it can support and some times it works, while other times it fails, or if this looks more like a memory leak where running in X times always works, but X+1 times crashes?" Ans: In my case, it crashes always X+1 time, like when I have 1.2 million rows in my dataset, everything works fine, but when I increase the data, it crashes with this error, it crashes on 1.5 million, 2 million rows as well, and any number of data between them. I cannot say if it is related to memory leak, but as I observed the error occurs when data increases a certain limit. PS: I will appreciate if you can provide and jar prior to the release to test if that one works fine with our data. |
After debugging and analysis, I found in my code that this statement: df = df.withColumn(self.output_col_name, concat_ws(col_sep, array(self.input_col_name_list))) was causing the error on larger data on gpus. And I think there is some bug in the gpu optimization of concat function, which needs to be addressed. |
@mstol thanks for the updated info. I am guessing that you just removed that line from your query, and because of that it dropped the total memory pressure at that point in time and for data being processed after it. |
@mtsol I have a snapshot jar that you can try. https://drive.google.com/file/d/15RyaI5OyeSJNEj5G-W4MnN8JeyQPq4ff/view?usp=sharing Be aware that there are some known bugs with it. Specifically #8147 which is caused by rapidsai/cudf#13173 so it should go without saying, but don't use this in production, and avoid the substring command if you can. If you want a better version I can upload another one once the issue is fixed. |
Describe the bug
This exception occures after a certain level of executions:
2023-04-04 08:08:52 WARN DAGScheduler:69 - Broadcasting large task binary with size 1811.4 KiB
2023-04-04 08:11:05 WARN TaskSetManager:69 - Lost task 0.0 in stage 443.0 (TID 2841) (10.84.179.52 executor 2): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni-release-1-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit
at ai.rapids.cudf.Table.concatenate(Native Method)
at ai.rapids.cudf.Table.concatenate(Table.java:1635)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$concatPending$2(GpuKeyBatchingIterator.scala:138)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:64)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:62)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.withResource(GpuKeyBatchingIterator.scala:34)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$concatPending$1(GpuKeyBatchingIterator.scala:123)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.withResource(GpuKeyBatchingIterator.scala:34)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.concatPending(GpuKeyBatchingIterator.scala:122)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$next$3(GpuKeyBatchingIterator.scala:166)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.withResource(GpuKeyBatchingIterator.scala:34)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$next$2(GpuKeyBatchingIterator.scala:165)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$next$2$adapted(GpuKeyBatchingIterator.scala:162)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
Steps/Code to reproduce bug
/u/bin/spark-3.1.1-bin-hadoop2.7/bin/spark-submit --master k8s://https:/k8s-master:6443 --deploy-mode cluster --name app-name --conf spark.local.dir=/y/mcpdata --conf spark.kubernetes.executor.request.cores=1 --conf spark.executor.cores=1 --conf spark.executor.instances=2 --conf spark.executor.memory=120G --conf spark.scheduler.mode=FAIR --conf spark.scheduler.allocation.file=/opt/spark/bin/fair_example.xml --conf spark.dynamicAllocation.enabled=false --conf spark.executor.heartbeatInterval=3600s --conf spark.network.timeout=36000s --conf spark.sql.broadcastTimeout=36000 --conf spark.driver.memory=70G --conf spark.kubernetes.namespace=default --conf spark.driver.maxResultSize=50g --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.pyspark.driver.python=python3.8 --conf spark.pyspark.python=python3.8 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.container.image=repo/app:tag --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-logs-volume.mount.path=/y/mcpdata --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-pipeline-stages-volume.mount.path=/u/bin/pipeline_stages --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-visualizations-volume.mount.path=/u/bin/evaluation_visualizations --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-visualizations-volume.options.claimName=fe-visualizations-volume --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-pipeline-stages-volume.options.claimName=fe-pipeline-stages-volume --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-logs-volume.options.claimName=fe-logs-volume --conf spark.kubernetes.driver.label.driver=driver --conf spark.kubernetes.spec.driver.dnsConfig=default-subdomain --conf spark.kubernetes.driverEnv.ENV_SERVER=QA --conf spark.executorEnv.ENV_SERVER=QA --conf spark.sql.adaptive.enabled=false --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.kubernetes.executor.podTemplateFile=/tmp/templates/gpu-template.yaml --conf spark.executor.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 --conf spark.rapids.sql.rowBasedUDF.enabled=true --conf spark.rapids.sql.concurrentGpuTasks=1 --conf spark.executor.resource.gpu.vendor=nvidia.com --conf spark.rapids.memory.gpu.oomDumpDir=/y/mcpdata --conf spark.rapids.memory.pinnedPool.size=50g --conf spark.executor.memoryOverhead=25g --conf spark.rapids.sql.batchSizeBytes=32m --conf spark.executor.resource.gpu.discoveryScript=/getGpusResources.sh --conf spark.rapids.sql.explain=ALL --conf spark.rapids.memory.host.spillStorageSize=20g --conf spark.sql.shuffle.partitions=50 --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/y/mcpdata/ --conf spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path=/u/spark-tmp --conf spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly=false --conf spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path=/u/spark-tmp --conf spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path=/u/spark-tmp --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer --conf spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly=false --conf spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path=/u/spark-tmp local:///code.py &
Expected behavior
some cudf.Table went out of memory.
Environment details (please complete the following information)
The text was updated successfully, but these errors were encountered: