Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] nvcomp usage for the UCX mode of the shuffle manager is broken #7850

Closed
abellina opened this issue Mar 6, 2023 · 1 comment · Fixed by #7926
Closed

[BUG] nvcomp usage for the UCX mode of the shuffle manager is broken #7850

abellina opened this issue Mar 6, 2023 · 1 comment · Fixed by #7926
Assignees
Labels
bug Something isn't working shuffle things that impact the shuffle plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Mar 6, 2023

While testing the coalesce code for compressed batches using LZ4 I got the following exception, I thought it was an issue with a branch I am working on, but this was with 23.04 without my change:

ai.rapids.cudf.nvcomp.NvcompException: nvcomp decompress output size mismatch
	at ai.rapids.cudf.nvcomp.NvcompJni.batchedLZ4DecompressAsync(Native Method)
	at ai.rapids.cudf.nvcomp.BatchedLZ4Decompressor.decompressAsync(BatchedLZ4Decompressor.java:79)
	at com.nvidia.spark.rapids.BatchedNvcompLZ4Decompressor.decompressAsync(NvcompLZ4CompressionCodec.scala:94)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.$anonfun$decompressBatch$1(TableCompressionCodec.scala:323)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.$anonfun$decompressBatch$1$adapted(TableCompressionCodec.scala:321)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.withResource(TableCompressionCodec.scala:258)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.decompressBatch(TableCompressionCodec.scala:321)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.finishAsync(TableCompressionCodec.scala:305)
	at com.nvidia.spark.rapids.GpuCompressionAwareCoalesceIterator.$anonfun$popAll$8(GpuCoalesceBatches.scala:639)
	at com.nvidia.spark.rapids.GpuCompressionAwareCoalesceIterator.$anonfun$popAll$8$adapted(GpuCoalesceBatches.scala:631)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:237)
	at com.nvidia.spark.rapids.GpuCompressionAwareCoalesceIterator.$anonfun$popAll$4(GpuCoalesceBatches.scala:631)
	at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:109)
	at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:107)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.closeOnExcept(GpuCoalesceBatches.scala:237)
	at com.nvidia.spark.rapids.GpuCompressionAwareCoalesceIterator.popAll(GpuCoalesceBatches.scala:615)
	at com.nvidia.spark.rapids.GpuCoalesceIterator.concatAllAndPutOnGPU(GpuCoalesceBatches.scala:543)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$next$6(GpuCoalesceBatches.scala:478)

I also see this:

23/03/06 22:49:51 WARN TaskSetManager: Lost task 13.0 in stage 1.0 (TID 18) (127.0.0.1 executor 0): ai.rapids.cudf.CudaException: CUDA error encountered at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-363-cuda11/thirdparty/cudf/java/src/main/native/src/CudaJni.cpp:377: 1 cudaErrorInvalidValue invalid argument
	at ai.rapids.cudf.Cuda.asyncMemcpyOnStream(Native Method)
	at ai.rapids.cudf.Cuda.asyncMemcpy(Cuda.java:529)
	at ai.rapids.cudf.Cuda.multiBufferCopyAsync(Cuda.java:582)
	at ai.rapids.cudf.nvcomp.BatchedLZ4Decompressor.fetchMetadata(BatchedLZ4Decompressor.java:192)
	at ai.rapids.cudf.nvcomp.BatchedLZ4Decompressor.buildAddrsSizesBuffer(BatchedLZ4Decompressor.java:121)
	at ai.rapids.cudf.nvcomp.BatchedLZ4Decompressor.decompressAsync(BatchedLZ4Decompressor.java:67)
	at com.nvidia.spark.rapids.BatchedNvcompLZ4Decompressor.decompressAsync(NvcompLZ4CompressionCodec.scala:94)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.$anonfun$decompressBatch$1(TableCompressionCodec.scala:323)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.$anonfun$decompressBatch$1$adapted(TableCompressionCodec.scala:321)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.withResource(TableCompressionCodec.scala:258)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.decompressBatch(TableCompressionCodec.scala:321)
	at com.nvidia.spark.rapids.BatchedBufferDecompressor.finishAsync(TableCompressionCodec.scala:305)
	at com.nvidia.spark.rapids.GpuCompressionAwareCoalesceIterator.$anonfun$popAll$8(GpuCoalesceBatches.scala:639)
	at com.nvidia.spark.rapids.GpuCompressionAwareCoalesceIterator.$anonfun$popAll$8$adapted(GpuCoalesceBatches.scala:631)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify shuffle things that impact the shuffle plugin labels Mar 6, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Mar 7, 2023
@abellina
Copy link
Collaborator Author

I can confirm that if I compress/decompress in the same stack, I am not seeing an issue. I've done this with LZ4 and with the copy codec. The problem I am seeing is after buffers are cached in the spill framework, so this is a spill framework issue with compressed vectors as far as I can tell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working shuffle things that impact the shuffle plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants