-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encountered column data outside the range of input buffer #1861
Comments
@nvdbaranec fyi. I think this probably not enough info, and the likely next step is for us to change the plugin to output metadata or the actual data before contig split is invoked. |
@ericrife can you confirm that no other Exceptions/ERROR were reported in the app? |
With help from @ericrife to capture input data that reproduced the issue, @nvdbaranec tracked this down to an issue with cudf's |
I checked with @ericrife and for his particular case the problem can be avoided by specifying a lower value for The input data is probably well compressed, and trying to load 512m of compressed data in one task ends up with far more than 2GB of uncompressed input data which can trigger the |
…7515) Fixes: #7514 Related: NVIDIA/spark-rapids#1861 There were a couple of places where 32 bit values were being used for buffer sizes that needed to be 64 bit. Authors: - @nvdbaranec Approvers: - Vukasin Milovanovic (@vuule) - Jake Hemstad (@jrhemstad) URL: #7515
This is fixed for legacy shuffle in 0.4 by disabling coalesce_batch. It is mitigated by using a smaller setting for |
cudf change has been merged, this is fixed. |
…apidsai#7515) Fixes: rapidsai#7514 Related: NVIDIA/spark-rapids#1861 There were a couple of places where 32 bit values were being used for buffer sizes that needed to be 64 bit. Authors: - @nvdbaranec Approvers: - Vukasin Milovanovic (@vuule) - Jake Hemstad (@jrhemstad) URL: rapidsai#7515
While running an ETL job on a customer churn dataset I get the following error and the job fails.
21/03/03 21:46:00 ERROR Executor: Exception in task 0.3 in stage 6.0 (TID 404)
ai.rapids.cudf.CudfException: cuDF failure at: /ansible-managed/jenkins-slave/slave2/workspace/spark/cudf18_nightly/cpp/src/copying/pack.cpp:113: Encountered column data outside the range of input buffer
at ai.rapids.cudf.Table.contiguousSplit(Native Method)
at ai.rapids.cudf.Table.contiguousSplit(Table.java:1625)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$3(SpillableColumnarBatch.scala:167)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$3$adapted(SpillableColumnarBatch.scala:166)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.withResource(SpillableColumnarBatch.scala:124)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1(SpillableColumnarBatch.scala:166)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1$adapted(SpillableColumnarBatch.scala:147)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.withResource(SpillableColumnarBatch.scala:124)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.addBatch(SpillableColumnarBatch.scala:147)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:138)
at com.nvidia.spark.rapids.GpuCoalesceIterator.saveOnDeck(GpuCoalesceBatches.scala:556)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$2(GpuCoalesceBatches.scala:189)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$2$adapted(GpuCoalesceBatches.scala:184)
at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:67)
at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:65)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.closeOnExcept(GpuCoalesceBatches.scala:133)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1(GpuCoalesceBatches.scala:184)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1$adapted(GpuCoalesceBatches.scala:182)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:133)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:182)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:182)
at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:207)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.lang.IllegalStateException: Close called too many times ColumnVector{rows=37480400, type=STRING, nullCount=Optional.empty, offHeap=(ID: 77 0)}
at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:207)
at com.nvidia.spark.rapids.GpuColumnVector.close(GpuColumnVector.java:689)
at org.apache.spark.sql.vectorized.ColumnarBatch.close(ColumnarBatch.java:48)
at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:51)
at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:70)
... 24 moreDescribe the bug
This occurs every time I run the job, regardless of different spark runtime options. Here are my runtime options
SPARK_HOME=/opt/spark/spark-3.0.2-bin-hadoop3.2
PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
SPARK_RAPIDS_DIR=/opt/sparkRapidsPlugin
SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}/cudf-0.18-20210224.094644-72-cuda11.jar
SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-0.4.0-20210224.032555-90.jar
JARS=${XGBOOST_JAR},${XGBOOST_SPARK_JAR},${SPARK_RAPIDS_PLUGIN_JAR},${SPARK_CUDF_JAR}
export PYTHONPATH=/usr/bin/python3:$PYTHONPATH
export PYSPARK_PYTHON=python3.6
LOG_SECOND=
date +%s
LOGFILE="logs/$0.txt.$LOG_SECOND"
mkdir -p logs
MASTER="spark://master.node.ip:7077"
TOTAL_CORES=320
NUM_EXECUTORS=16 # change to fit how many GPUs you have
NUM_EXECUTOR_CORES=$((${TOTAL_CORES}/${NUM_EXECUTORS}))
RESOURCE_GPU_AMT="0.05"
TOTAL_MEMORY=700 # unit: GB$(($ {TOTAL_MEMORY}-$((${DRIVER_MEMORY}*1000/1024))))/${NUM_EXECUTORS}))
DRIVER_MEMORY=10 # unit: GB
EXECUTOR_MEMORY=$((
CMD_PARAMS="--master $MASTER
--driver-memory ${DRIVER_MEMORY}G
--executor-cores $NUM_EXECUTOR_CORES
--executor-memory ${EXECUTOR_MEMORY}G
--conf spark.cores.max=$TOTAL_CORES
--conf spark.task.cpus=1
--conf spark.task.resource.gpu.amount=$RESOURCE_GPU_AMT
--conf spark.rapids.sql.enabled=True
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.rapids.memory.pinnedPool.size=2G
--conf spark.sql.shuffle.partitions=1024
--conf spark.sql.files.maxPartitionBytes=2G
--conf spark.rapids.sql.concurrentGpuTasks=8
--conf spark.executor.resource.gpu.amount=1
--conf spark.sql.adaptive.enabled=True
--conf spark.rapids.sql.variableFloatAgg.enabled=True
--conf spark.rapids.sql.explain=NOT_ON_GPU
--conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}
--conf spark.driver.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}
--jars $JARS
This is on a spark 3.0.2 cluster with the dataset existing on a Hadoop 3.2.1 storage framework.
I have tried to adjust the partitions all the way down to 48 and dropped maxPartitionBytes down to 512M but have not been able to get past this error. This is being run on a 200G dataset.
The text was updated successfully, but these errors were encountered: