-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable ZSTD compression in ORC and Parquet writers #11551
Enable ZSTD compression in ORC and Parquet writers #11551
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #11551 +/- ##
===============================================
Coverage ? 86.42%
===============================================
Files ? 145
Lines ? 23009
Branches ? 0
===============================================
Hits ? 19885
Misses ? 3124
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments. Also need to add ZSTD (and it looks like a few others) to CompressionType.java. It should match the compression_type enum from types.hpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java approval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this looks good to me. Great work cleaning this up!
One question. As I understand it, if LIBCUDF_NVCOMP_POLICY=STABLE, choosing ZSTD compression will result in uncompressed output (as opposed to a failure), is that correct?
It will actually fail with "unsupported compression type". This is the behavior with DEFLATE (ZLIB) as well. I'm rethinking this approach, as users already "opt-in" to the new feature by selecting the ZSTD compression when writing. Any preference on your end? |
Currently in the spark rapids plugin we don't use the GPU for writing parquet/orc if the compression type is ZSTD. Once we enable that, any spark job that selects zstd as the compressor will fail with unsupported compression type if they don't define If we change this to silently write uncompressed data, then the job will succeed, but data will be uncompressed. This seems worse because there would be no indication that anything was wrong (other than output size). So I think the failure is better. The question for spark-rapids plugin is whether we wait for this to be stable before enabling it in the plugin, or document the need to define |
…fea-nvcomp-zstd-comp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor not-blocking comments for the Python code. LGTM
@gpucibot merge |
The recently merged PR (#11551) did not include the `<optional>` header which may cause compile error in some systems (in particular, CUDA 11.7 + gcc-11.2): ``` error: ‘std::optional’ has not been declared error: ‘optional’ in namespace ‘std’ does not name a template type ``` This PR adds that missing header to fix the compile issue. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - David Wendt (https://github.com/davidwendt) URL: #11697
This PR fixes an error that can occur when very small page sizes are used when writing Parquet files. #11551 changed from fixed 5000 row page fragments to a scaled value based on the requested max page size. For small page sizes, the number of fragments to process can exceed 64k. The number of fragments is used as the `y` dimension when calling `gpuInitPageFragments`, and when it exceeds 64k the kernel fails to launch, ultimately leading to an invalid memory access. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #11869
hi vuule, |
Description
Closes #9058, #9056
Expands nvCOMP adapter to include ZSTD compression.
Adds centralized nvCOMP policy.
is_compression_enabled
.Adds centralized nvCOMP alignment utility,
compress_input_alignment_bits
.Adds centralized nvCOMP utility to get the maximum supported compression chunk size -
batched_compress_max_allowed_chunk_size
.Encoded ORC row groups are aligned based on compression requirements.
Encoded Parquet pages are aligned based on compression requirements.
Parquet fragment size now scales with the page size to better fit the default page size with ZSTD compression.
Small refactoring around
decompress_status
for improved type safety and hopefully naming.Replaced
snappy_compress
from the Parquet writer with the nvCOMP adapter call.Vectors of
compression_result
s are initialized before compression to avoid issues with random chunk skipping due to uninitialized memory.Checklist