Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountered column data outside the range of input buffer #1861

Closed
ericrife opened this issue Mar 3, 2021 · 6 comments
Closed

Encountered column data outside the range of input buffer #1861

ericrife opened this issue Mar 3, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@ericrife
Copy link

ericrife commented Mar 3, 2021

While running an ETL job on a customer churn dataset I get the following error and the job fails.

21/03/03 21:46:00 ERROR Executor: Exception in task 0.3 in stage 6.0 (TID 404)
ai.rapids.cudf.CudfException: cuDF failure at: /ansible-managed/jenkins-slave/slave2/workspace/spark/cudf18_nightly/cpp/src/copying/pack.cpp:113: Encountered column data outside the range of input buffer
at ai.rapids.cudf.Table.contiguousSplit(Native Method)
at ai.rapids.cudf.Table.contiguousSplit(Table.java:1625)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$3(SpillableColumnarBatch.scala:167)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$3$adapted(SpillableColumnarBatch.scala:166)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.withResource(SpillableColumnarBatch.scala:124)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1(SpillableColumnarBatch.scala:166)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1$adapted(SpillableColumnarBatch.scala:147)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.withResource(SpillableColumnarBatch.scala:124)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.addBatch(SpillableColumnarBatch.scala:147)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:138)
at com.nvidia.spark.rapids.GpuCoalesceIterator.saveOnDeck(GpuCoalesceBatches.scala:556)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$2(GpuCoalesceBatches.scala:189)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$2$adapted(GpuCoalesceBatches.scala:184)
at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:67)
at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:65)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.closeOnExcept(GpuCoalesceBatches.scala:133)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1(GpuCoalesceBatches.scala:184)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1$adapted(GpuCoalesceBatches.scala:182)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:133)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:182)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:182)
at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:207)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.lang.IllegalStateException: Close called too many times ColumnVector{rows=37480400, type=STRING, nullCount=Optional.empty, offHeap=(ID: 77 0)}
at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:207)
at com.nvidia.spark.rapids.GpuColumnVector.close(GpuColumnVector.java:689)
at org.apache.spark.sql.vectorized.ColumnarBatch.close(ColumnarBatch.java:48)
at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:51)
at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:70)
... 24 moreDescribe the bug

This occurs every time I run the job, regardless of different spark runtime options. Here are my runtime options

SPARK_HOME=/opt/spark/spark-3.0.2-bin-hadoop3.2
PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
SPARK_RAPIDS_DIR=/opt/sparkRapidsPlugin

SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}/cudf-0.18-20210224.094644-72-cuda11.jar
SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-0.4.0-20210224.032555-90.jar

JARS=${XGBOOST_JAR},${XGBOOST_SPARK_JAR},${SPARK_RAPIDS_PLUGIN_JAR},${SPARK_CUDF_JAR}

export PYTHONPATH=/usr/bin/python3:$PYTHONPATH
export PYSPARK_PYTHON=python3.6

LOG_SECOND=date +%s
LOGFILE="logs/$0.txt.$LOG_SECOND"
mkdir -p logs

MASTER="spark://master.node.ip:7077"

TOTAL_CORES=320
NUM_EXECUTORS=16 # change to fit how many GPUs you have
NUM_EXECUTOR_CORES=$((${TOTAL_CORES}/${NUM_EXECUTORS}))

RESOURCE_GPU_AMT="0.05"

TOTAL_MEMORY=700 # unit: GB
DRIVER_MEMORY=10 # unit: GB
EXECUTOR_MEMORY=$(($((${TOTAL_MEMORY}-$((${DRIVER_MEMORY}*1000/1024))))/${NUM_EXECUTORS}))

CMD_PARAMS="--master $MASTER
--driver-memory ${DRIVER_MEMORY}G
--executor-cores $NUM_EXECUTOR_CORES
--executor-memory ${EXECUTOR_MEMORY}G
--conf spark.cores.max=$TOTAL_CORES
--conf spark.task.cpus=1
--conf spark.task.resource.gpu.amount=$RESOURCE_GPU_AMT
--conf spark.rapids.sql.enabled=True
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.rapids.memory.pinnedPool.size=2G
--conf spark.sql.shuffle.partitions=1024
--conf spark.sql.files.maxPartitionBytes=2G
--conf spark.rapids.sql.concurrentGpuTasks=8
--conf spark.executor.resource.gpu.amount=1
--conf spark.sql.adaptive.enabled=True
--conf spark.rapids.sql.variableFloatAgg.enabled=True
--conf spark.rapids.sql.explain=NOT_ON_GPU
--conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}
--conf spark.driver.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}
--jars $JARS

This is on a spark 3.0.2 cluster with the dataset existing on a Hadoop 3.2.1 storage framework.

I have tried to adjust the partitions all the way down to 48 and dropped maxPartitionBytes down to 512M but have not been able to get past this error. This is being run on a 200G dataset.

@ericrife ericrife added ? - Needs Triage Need team to review and classify bug Something isn't working labels Mar 3, 2021
@abellina
Copy link
Collaborator

abellina commented Mar 3, 2021

@nvdbaranec fyi. I think this probably not enough info, and the likely next step is for us to change the plugin to output metadata or the actual data before contig split is invoked.

@abellina
Copy link
Collaborator

abellina commented Mar 3, 2021

@ericrife can you confirm that no other Exceptions/ERROR were reported in the app?

@jlowe
Copy link
Member

jlowe commented Mar 4, 2021

With help from @ericrife to capture input data that reproduced the issue, @nvdbaranec tracked this down to an issue with cudf's contiguous_split not handling output partitions exceeding 2GB properly. I filed rapidsai/cudf#7514 to track the issue in cudf.

@jlowe
Copy link
Member

jlowe commented Mar 4, 2021

I checked with @ericrife and for his particular case the problem can be avoided by specifying a lower value for spark.sql.files.maxPartitionBytes. Originally it was failing when set to 512m but the query passed when it was lowered to 256m.

The input data is probably well compressed, and trying to load 512m of compressed data in one task ends up with far more than 2GB of uncompressed input data which can trigger the contiguous_split bug.

@sameerz sameerz added this to the Mar 1 - Mar 12 milestone Mar 4, 2021
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Mar 10, 2021
…7515)

Fixes:
#7514

Related:
NVIDIA/spark-rapids#1861

There were a couple of places where 32 bit values were being used for buffer sizes that needed to be 64 bit.

Authors:
  - @nvdbaranec

Approvers:
  - Vukasin Milovanovic (@vuule)
  - Jake Hemstad (@jrhemstad)

URL: #7515
@sameerz
Copy link
Collaborator

sameerz commented Mar 11, 2021

This is fixed for legacy shuffle in 0.4 by disabling coalesce_batch. It is mitigated by using a smaller setting for spark.sql.files.maxPartitionBytes. This could still occur in UCX shuffle in 0.4. The right fix will be in 0.5 with RAPIDS 0.19, and the fix for rapidsai/cudf#7515 . Moving this to 0.5 to ensure the long term fix is tracked there.

@sameerz sameerz removed this from the Mar 1 - Mar 12 milestone Mar 11, 2021
@jlowe
Copy link
Member

jlowe commented Mar 11, 2021

cudf change has been merged, this is fixed.

@jlowe jlowe closed this as completed Mar 11, 2021
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this issue Mar 25, 2021
…apidsai#7515)

Fixes:
rapidsai#7514

Related:
NVIDIA/spark-rapids#1861

There were a couple of places where 32 bit values were being used for buffer sizes that needed to be 64 bit.

Authors:
  - @nvdbaranec

Approvers:
  - Vukasin Milovanovic (@vuule)
  - Jake Hemstad (@jrhemstad)

URL: rapidsai#7515
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants