Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable ZSTD compression in ORC and Parquet writers #11551

Merged
merged 72 commits into from
Sep 12, 2022

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Aug 17, 2022

Description

Closes #9058, #9056

Expands nvCOMP adapter to include ZSTD compression.
Adds centralized nvCOMP policy. is_compression_enabled.
Adds centralized nvCOMP alignment utility, compress_input_alignment_bits.
Adds centralized nvCOMP utility to get the maximum supported compression chunk size - batched_compress_max_allowed_chunk_size.
Encoded ORC row groups are aligned based on compression requirements.
Encoded Parquet pages are aligned based on compression requirements.
Parquet fragment size now scales with the page size to better fit the default page size with ZSTD compression.
Small refactoring around decompress_status for improved type safety and hopefully naming.
Replaced snappy_compress from the Parquet writer with the nvCOMP adapter call.
Vectors of compression_results are initialized before compression to avoid issues with random chunk skipping due to uninitialized memory.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Aug 17, 2022
@vuule vuule self-assigned this Aug 17, 2022
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Aug 17, 2022
@codecov
Copy link

codecov bot commented Aug 17, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@dca285b). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 27ce95e differs from pull request most recent head 1f60695. Consider uploading reports for the commit 1f60695 to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10   #11551   +/-   ##
===============================================
  Coverage                ?   86.42%           
===============================================
  Files                   ?      145           
  Lines                   ?    23009           
  Branches                ?        0           
===============================================
  Hits                    ?    19885           
  Misses                  ?     3124           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@github-actions github-actions bot added the CMake CMake build issue label Aug 17, 2022
@github-actions github-actions bot added the Java Affects Java cuDF API. label Aug 18, 2022
@vuule vuule changed the base branch from branch-22.10 to branch-22.08 August 18, 2022 00:06
@vuule vuule changed the base branch from branch-22.08 to branch-22.10 August 18, 2022 00:06
@github-actions github-actions bot removed the Java Affects Java cuDF API. label Aug 18, 2022
@vuule vuule changed the title Enable ZSTD compression in ORC writer Enable ZSTD compression in ORC and Parquet writers Aug 18, 2022
Copy link
Contributor

@jbrennan333 jbrennan333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments. Also need to add ZSTD (and it looks like a few others) to CompressionType.java. It should match the compression_type enum from types.hpp.

cpp/cmake/thirdparty/get_nvcomp.cmake Outdated Show resolved Hide resolved
cpp/src/io/comp/nvcomp_adapter.cpp Show resolved Hide resolved
cpp/src/io/comp/nvcomp_adapter.cpp Show resolved Hide resolved
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java approval

Copy link
Contributor

@jbrennan333 jbrennan333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this looks good to me. Great work cleaning this up!

One question. As I understand it, if LIBCUDF_NVCOMP_POLICY=STABLE, choosing ZSTD compression will result in uncompressed output (as opposed to a failure), is that correct?

@vuule
Copy link
Contributor Author

vuule commented Sep 6, 2022

+1 this looks good to me. Great work cleaning this up!

One question. As I understand it, if LIBCUDF_NVCOMP_POLICY=STABLE, choosing ZSTD compression will result in uncompressed output (as opposed to a failure), is that correct?

It will actually fail with "unsupported compression type". This is the behavior with DEFLATE (ZLIB) as well. I'm rethinking this approach, as users already "opt-in" to the new feature by selecting the ZSTD compression when writing. Any preference on your end?

@jbrennan333
Copy link
Contributor

One question. As I understand it, if LIBCUDF_NVCOMP_POLICY=STABLE, choosing ZSTD compression will result in uncompressed output (as opposed to a failure), is that correct?

It will actually fail with "unsupported compression type". This is the behavior with DEFLATE (ZLIB) as well. I'm rethinking this approach, as users already "opt-in" to the new feature by selecting the ZSTD compression when writing. Any preference on your end?

Currently in the spark rapids plugin we don't use the GPU for writing parquet/orc if the compression type is ZSTD. Once we enable that, any spark job that selects zstd as the compressor will fail with unsupported compression type if they don't define LIBCUDF_NVCOMP_POLICY=ALWAYS.

If we change this to silently write uncompressed data, then the job will succeed, but data will be uncompressed. This seems worse because there would be no indication that anything was wrong (other than output size).

So I think the failure is better. The question for spark-rapids plugin is whether we wait for this to be stable before enabling it in the plugin, or document the need to define LIBCUDF_NVCOMP_POLICY=ALWAYS if you are using ZSTD compression.

@vuule vuule requested a review from hyperbolic2346 September 6, 2022 21:04
Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor not-blocking comments for the Python code. LGTM

@vuule vuule added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 9, 2022
@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 5 - DO NOT MERGE Hold off on merging; see PR for details labels Sep 9, 2022
@vuule
Copy link
Contributor Author

vuule commented Sep 12, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 578e65f into rapidsai:branch-22.10 Sep 12, 2022
rapids-bot bot pushed a commit that referenced this pull request Sep 13, 2022
The recently merged PR (#11551) did not include the `<optional>` header which may cause compile error in some systems (in particular, CUDA 11.7 + gcc-11.2):
```
error: ‘std::optional’ has not been declared
error: ‘optional’ in namespace ‘std’ does not name a template type
```

This PR adds that missing header to fix the compile issue.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - David Wendt (https://github.com/davidwendt)

URL: #11697
@vuule vuule deleted the fea-nvcomp-zstd-comp branch September 19, 2022 20:17
rapids-bot bot pushed a commit that referenced this pull request Oct 20, 2022
This PR fixes an error that can occur when very small page sizes are used when writing Parquet files. #11551 changed from fixed 5000 row page fragments to a scaled value based on the requested max page size. For small page sizes, the number of fragments to process can exceed 64k. The number of fragments is used as the `y` dimension when calling `gpuInitPageFragments`, and when it exceeds 64k the kernel fails to launch, ultimately leading to an invalid memory access.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #11869
@xingwenqiang
Copy link

hi vuule,
I see that nvcomp supports deflate gpu encoding, and orc supports zlib compression. I wonder if parquet will support zlib gpu compression in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] ZStandard codec support for ORC writer
7 participants