Encountered column data outside the range of input buffer #1861

ericrife · 2021-03-03T21:57:22Z

While running an ETL job on a customer churn dataset I get the following error and the job fails.

21/03/03 21:46:00 ERROR Executor: Exception in task 0.3 in stage 6.0 (TID 404)
ai.rapids.cudf.CudfException: cuDF failure at: /ansible-managed/jenkins-slave/slave2/workspace/spark/cudf18_nightly/cpp/src/copying/pack.cpp:113: Encountered column data outside the range of input buffer
at ai.rapids.cudf.Table.contiguousSplit(Native Method)
at ai.rapids.cudf.Table.contiguousSplit(Table.java:1625)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$3(SpillableColumnarBatch.scala:167)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$3$adapted(SpillableColumnarBatch.scala:166)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.withResource(SpillableColumnarBatch.scala:124)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1(SpillableColumnarBatch.scala:166)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1$adapted(SpillableColumnarBatch.scala:147)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.withResource(SpillableColumnarBatch.scala:124)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.addBatch(SpillableColumnarBatch.scala:147)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:138)
at com.nvidia.spark.rapids.GpuCoalesceIterator.saveOnDeck(GpuCoalesceBatches.scala:556)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$2(GpuCoalesceBatches.scala:189)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$2$adapted(GpuCoalesceBatches.scala:184)
at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:67)
at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:65)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.closeOnExcept(GpuCoalesceBatches.scala:133)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1(GpuCoalesceBatches.scala:184)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1$adapted(GpuCoalesceBatches.scala:182)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:133)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:182)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:182)
at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:207)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.lang.IllegalStateException: Close called too many times ColumnVector{rows=37480400, type=STRING, nullCount=Optional.empty, offHeap=(ID: 77 0)}
at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:207)
at com.nvidia.spark.rapids.GpuColumnVector.close(GpuColumnVector.java:689)
at org.apache.spark.sql.vectorized.ColumnarBatch.close(ColumnarBatch.java:48)
at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:51)
at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:70)
... 24 moreDescribe the bug

This occurs every time I run the job, regardless of different spark runtime options. Here are my runtime options

SPARK_HOME=/opt/spark/spark-3.0.2-bin-hadoop3.2
PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
SPARK_RAPIDS_DIR=/opt/sparkRapidsPlugin

SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}/cudf-0.18-20210224.094644-72-cuda11.jar
SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-0.4.0-20210224.032555-90.jar

JARS=${XGBOOST_JAR},${XGBOOST_SPARK_JAR},${SPARK_RAPIDS_PLUGIN_JAR},${SPARK_CUDF_JAR}

export PYTHONPATH=/usr/bin/python3:$PYTHONPATH
export PYSPARK_PYTHON=python3.6

LOG_SECOND=date +%s
LOGFILE="logs/$0.txt.$LOG_SECOND"
mkdir -p logs

MASTER="spark://master.node.ip:7077"

TOTAL_CORES=320
NUM_EXECUTORS=16 # change to fit how many GPUs you have
NUM_EXECUTOR_CORES=$((${TOTAL_CORES}/${NUM_EXECUTORS}))

RESOURCE_GPU_AMT="0.05"

TOTAL_MEMORY=700 # unit: GB
DRIVER_MEMORY=10 # unit: GB
EXECUTOR_MEMORY=$(($((${TOTAL_MEMORY}-$((${DRIVER_MEMORY}*1000/1024))))/${NUM_EXECUTORS}))

CMD_PARAMS="--master $MASTER
--driver-memory ${DRIVER_MEMORY}G
--executor-cores $NUM_EXECUTOR_CORES
--executor-memory ${EXECUTOR_MEMORY}G
--conf spark.cores.max=$TOTAL_CORES
--conf spark.task.cpus=1
--conf spark.task.resource.gpu.amount=$RESOURCE_GPU_AMT
--conf spark.rapids.sql.enabled=True
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.rapids.memory.pinnedPool.size=2G
--conf spark.sql.shuffle.partitions=1024
--conf spark.sql.files.maxPartitionBytes=2G
--conf spark.rapids.sql.concurrentGpuTasks=8
--conf spark.executor.resource.gpu.amount=1
--conf spark.sql.adaptive.enabled=True
--conf spark.rapids.sql.variableFloatAgg.enabled=True
--conf spark.rapids.sql.explain=NOT_ON_GPU
--conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}
--conf spark.driver.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}
--jars $JARS

This is on a spark 3.0.2 cluster with the dataset existing on a Hadoop 3.2.1 storage framework.

I have tried to adjust the partitions all the way down to 48 and dropped maxPartitionBytes down to 512M but have not been able to get past this error. This is being run on a 200G dataset.

The text was updated successfully, but these errors were encountered:

abellina · 2021-03-03T22:13:26Z

@nvdbaranec fyi. I think this probably not enough info, and the likely next step is for us to change the plugin to output metadata or the actual data before contig split is invoked.

abellina · 2021-03-03T22:14:33Z

@ericrife can you confirm that no other Exceptions/ERROR were reported in the app?

jlowe · 2021-03-04T17:58:18Z

With help from @ericrife to capture input data that reproduced the issue, @nvdbaranec tracked this down to an issue with cudf's contiguous_split not handling output partitions exceeding 2GB properly. I filed rapidsai/cudf#7514 to track the issue in cudf.

jlowe · 2021-03-04T20:56:50Z

I checked with @ericrife and for his particular case the problem can be avoided by specifying a lower value for spark.sql.files.maxPartitionBytes. Originally it was failing when set to 512m but the query passed when it was lowered to 256m.

The input data is probably well compressed, and trying to load 512m of compressed data in one task ends up with far more than 2GB of uncompressed input data which can trigger the contiguous_split bug.

@nvdbaranec

…7515) Fixes: #7514 Related: NVIDIA/spark-rapids#1861 There were a couple of places where 32 bit values were being used for buffer sizes that needed to be 64 bit. Authors: - @nvdbaranec Approvers: - Vukasin Milovanovic (@vuule) - Jake Hemstad (@jrhemstad) URL: #7515

sameerz · 2021-03-11T00:15:27Z

This is fixed for legacy shuffle in 0.4 by disabling coalesce_batch. It is mitigated by using a smaller setting for spark.sql.files.maxPartitionBytes. This could still occur in UCX shuffle in 0.4. The right fix will be in 0.5 with RAPIDS 0.19, and the fix for rapidsai/cudf#7515 . Moving this to 0.5 to ensure the long term fix is tracked there.

jlowe · 2021-03-11T17:31:21Z

cudf change has been merged, this is fixed.

@nvdbaranec

…apidsai#7515) Fixes: rapidsai#7514 Related: NVIDIA/spark-rapids#1861 There were a couple of places where 32 bit values were being used for buffer sizes that needed to be 64 bit. Authors: - @nvdbaranec Approvers: - Vukasin Milovanovic (@vuule) - Jake Hemstad (@jrhemstad) URL: rapidsai#7515

ericrife added ? - Needs Triage Need team to review and classify bug Something isn't working labels Mar 3, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Mar 4, 2021

jlowe mentioned this issue Mar 4, 2021

Disable coalesce batch spilling to avoid cudf contiguous_split bug #1871

Merged

nvdbaranec mentioned this issue Mar 4, 2021

Fix contiguous_split not properly handling output partitions > 2 GB. rapidsai/cudf#7515

Merged

sameerz added this to the Mar 1 - Mar 12 milestone Mar 4, 2021

sameerz removed this from the Mar 1 - Mar 12 milestone Mar 11, 2021

jlowe closed this as completed Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encountered column data outside the range of input buffer #1861

Encountered column data outside the range of input buffer #1861

ericrife commented Mar 3, 2021

abellina commented Mar 3, 2021

abellina commented Mar 3, 2021

jlowe commented Mar 4, 2021

jlowe commented Mar 4, 2021

sameerz commented Mar 11, 2021

jlowe commented Mar 11, 2021

Encountered column data outside the range of input buffer #1861

Encountered column data outside the range of input buffer #1861

Comments

ericrife commented Mar 3, 2021

abellina commented Mar 3, 2021

abellina commented Mar 3, 2021

jlowe commented Mar 4, 2021

jlowe commented Mar 4, 2021

sameerz commented Mar 11, 2021

jlowe commented Mar 11, 2021