Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] ZStandard support for Parquet writer #9056

Closed
jlowe opened this issue Aug 17, 2021 · 4 comments
Closed

[FEA] ZStandard support for Parquet writer #9056

jlowe opened this issue Aug 17, 2021 · 4 comments
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Aug 17, 2021

Is your feature request related to a problem? Please describe.
Some users wish to write Parquet data using the ZStandard compression codec rather than the Snappy codec. RAPIDS is unable to accelerate writing of these files due to the lack of support for this codec on Parquet writes.

Describe the solution you'd like
The libcudf Parquet writer APIs should support specifying the ZStandard codec as one of the possible compression codecs to use when encoding the Parquet data for writing.

@jlowe jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Aug 17, 2021
@beckernick beckernick removed the Needs Triage Need team to review and classify label Aug 23, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jlowe
Copy link
Member Author

jlowe commented Nov 15, 2021

Still desired

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@jlowe
Copy link
Member Author

jlowe commented Feb 14, 2022

Still desired

@vuule vuule self-assigned this Sep 8, 2022
rapids-bot bot pushed a commit that referenced this issue Sep 12, 2022
Closes #9058, #9056

Expands nvCOMP adapter to include ZSTD compression.
Adds centralized nvCOMP policy. `is_compression_enabled`.
Adds centralized nvCOMP alignment utility, `compress_input_alignment_bits`.
Adds centralized nvCOMP utility to get the maximum supported compression chunk size - `batched_compress_max_allowed_chunk_size`.
Encoded ORC row groups are aligned based on compression requirements.
Encoded Parquet pages are aligned based on compression requirements.
Parquet fragment size now scales with the page size to better fit the default page size with ZSTD compression.
Small refactoring around `decompress_status` for improved type safety and hopefully naming.
Replaced `snappy_compress` from the Parquet writer with the nvCOMP adapter call.
Vectors of `compression_result`s are initialized before compression to avoid issues with random chunk skipping due to uninitialized memory.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Jim Brennan (https://github.com/jbrennan333)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Tobias Ribizel (https://github.com/upsj)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #11551
@vuule vuule closed this as completed Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

3 participants