[BUG] nvcomp usage for the UCX mode of the shuffle manager is broken #7850

abellina · 2023-03-06T22:54:29Z

While testing the coalesce code for compressed batches using LZ4 I got the following exception, I thought it was an issue with a branch I am working on, but this was with 23.04 without my change:

ai.rapids.cudf.nvcomp.NvcompException: nvcomp decompress output size mismatch
	at ai.rapids.cudf.nvcomp.NvcompJni.batchedLZ4DecompressAsync(Native Method)
	at ai.rapids.cudf.nvcomp.BatchedLZ4Decompressor.decompressAsync(BatchedLZ4Decompressor.java:79)
	at com.nvidia.spark.rapids.BatchedNvcompLZ4Decompressor.decompressAsync(NvcompLZ4CompressionCodec.scala:94)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.$anonfun$decompressBatch$1(TableCompressionCodec.scala:323)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.$anonfun$decompressBatch$1$adapted(TableCompressionCodec.scala:321)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.withResource(TableCompressionCodec.scala:258)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.decompressBatch(TableCompressionCodec.scala:321)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.finishAsync(TableCompressionCodec.scala:305)
	at com.nvidia.spark.rapids.GpuCompressionAwareCoalesceIterator.$anonfun$popAll$8(GpuCoalesceBatches.scala:639)
	at com.nvidia.spark.rapids.GpuCompressionAwareCoalesceIterator.$anonfun$popAll$8$adapted(GpuCoalesceBatches.scala:631)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:237)
	at com.nvidia.spark.rapids.GpuCompressionAwareCoalesceIterator.$anonfun$popAll$4(GpuCoalesceBatches.scala:631)
	at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:109)
	at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:107)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.closeOnExcept(GpuCoalesceBatches.scala:237)
	at com.nvidia.spark.rapids.GpuCompressionAwareCoalesceIterator.popAll(GpuCoalesceBatches.scala:615)
	at com.nvidia.spark.rapids.GpuCoalesceIterator.concatAllAndPutOnGPU(GpuCoalesceBatches.scala:543)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$next$6(GpuCoalesceBatches.scala:478)

I also see this:

23/03/06 22:49:51 WARN TaskSetManager: Lost task 13.0 in stage 1.0 (TID 18) (127.0.0.1 executor 0): ai.rapids.cudf.CudaException: CUDA error encountered at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-363-cuda11/thirdparty/cudf/java/src/main/native/src/CudaJni.cpp:377: 1 cudaErrorInvalidValue invalid argument
	at ai.rapids.cudf.Cuda.asyncMemcpyOnStream(Native Method)
	at ai.rapids.cudf.Cuda.asyncMemcpy(Cuda.java:529)
	at ai.rapids.cudf.Cuda.multiBufferCopyAsync(Cuda.java:582)
	at ai.rapids.cudf.nvcomp.BatchedLZ4Decompressor.fetchMetadata(BatchedLZ4Decompressor.java:192)
	at ai.rapids.cudf.nvcomp.BatchedLZ4Decompressor.buildAddrsSizesBuffer(BatchedLZ4Decompressor.java:121)
	at ai.rapids.cudf.nvcomp.BatchedLZ4Decompressor.decompressAsync(BatchedLZ4Decompressor.java:67)
	at com.nvidia.spark.rapids.BatchedNvcompLZ4Decompressor.decompressAsync(NvcompLZ4CompressionCodec.scala:94)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.$anonfun$decompressBatch$1(TableCompressionCodec.scala:323)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.$anonfun$decompressBatch$1$adapted(TableCompressionCodec.scala:321)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.withResource(TableCompressionCodec.scala:258)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.decompressBatch(TableCompressionCodec.scala:321)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.finishAsync(TableCompressionCodec.scala:305)
	at com.nvidia.spark.rapids.GpuCompressionAwareCoalesceIterator.$anonfun$popAll$8(GpuCoalesceBatches.scala:639)
	at com.nvidia.spark.rapids.GpuCompressionAwareCoalesceIterator.$anonfun$popAll$8$adapted(GpuCoalesceBatches.scala:631)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)

The text was updated successfully, but these errors were encountered:

abellina · 2023-03-10T20:19:34Z

I can confirm that if I compress/decompress in the same stack, I am not seeing an issue. I've done this with LZ4 and with the copy codec. The problem I am seeing is after buffers are cached in the spill framework, so this is a spill framework issue with compressed vectors as far as I can tell.

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify shuffle things that impact the shuffle plugin labels Mar 6, 2023

abellina mentioned this issue Mar 6, 2023

[BUG] decompressed batches corrupt if they are made spillable #7827

Open

mattahrens removed the ? - Needs Triage Need team to review and classify label Mar 7, 2023

mattahrens assigned abellina Mar 7, 2023

abellina mentioned this issue Mar 23, 2023

Fixes issue where UCX compressed tables would be decompressed multiple times #7926

Merged

abellina closed this as completed in #7926 Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] nvcomp usage for the UCX mode of the shuffle manager is broken #7850

[BUG] nvcomp usage for the UCX mode of the shuffle manager is broken #7850

abellina commented Mar 6, 2023

abellina commented Mar 10, 2023

[BUG] nvcomp usage for the UCX mode of the shuffle manager is broken #7850

[BUG] nvcomp usage for the UCX mode of the shuffle manager is broken #7850

Comments

abellina commented Mar 6, 2023

abellina commented Mar 10, 2023