Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the default dictionary policy in Parquet writer from ALWAYS to ADAPTIVE #15570

Merged

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Apr 19, 2024

Description

This PR changes the default dictionary policy in parquet from ALWAYS to ADAPTIVE and adds an argument max_dictionary_size to control the ADAPTIVE-ness of the dictionary policy. This change prevents a silent fallback to UNCOMPRESSED when writing parquet files with ZSTD compression leading to better performance for several use cases.

Partially closes #15501.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 19, 2024
@mhaseeb123 mhaseeb123 added 1 - On Deck To be worked on next libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Apr 19, 2024
@mhaseeb123 mhaseeb123 changed the title Change the default dict policy from ALWAYS to ADAPTIVE Change the default dictionary policy in Parquet from ALWAYS to ADAPTIVE Apr 19, 2024
@mhaseeb123 mhaseeb123 changed the title Change the default dictionary policy in Parquet from ALWAYS to ADAPTIVE Change the default dictionary policy in Parquet writer from ALWAYS to ADAPTIVE Apr 19, 2024
@mhaseeb123
Copy link
Member Author

CC: @GregoryKimball

@vuule
Copy link
Contributor

vuule commented Apr 19, 2024

There's also the Python side to be changed.
Python writer has a bool use_dictionary which translates to either ALWAYS or NEVER. We probably need to change this part of the API. Not sure how it should be done (and maybe avoid a breaking change?).
If we just need a branch for testing, just a simple cudf_io_types.dictionary_policy.ALWAYS -> cudf_io_types.dictionary_policy.ADAPTIVE should do.

Copy link

copy-pr-bot bot commented Apr 19, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Apr 19, 2024
@mhaseeb123
Copy link
Member Author

mhaseeb123 commented Apr 19, 2024

There's also the Python side to be changed. Python writer has a bool use_dictionary which translates to either ALWAYS or NEVER. We probably need to change this part of the API. Not sure how it should be done (and maybe avoid a breaking change?). If we just need a branch for testing, just a simple cudf_io_types.dictionary_policy.ALWAYS -> cudf_io_types.dictionary_policy.ADAPTIVE should do.

Thanks, I have pushed the ^suggested change. This is only a testing branch as you guessed.

@GregoryKimball
Copy link
Contributor

Do you think this is would be a "breaking" change? @mhaseeb123 @vuule

@mhaseeb123
Copy link
Member Author

Do you think this is would be a "breaking" change? @mhaseeb123 @vuule

Probably not, functionality wise at least. 🤔

@mhaseeb123 mhaseeb123 requested a review from a team as a code owner May 8, 2024 00:26
@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 1 - On Deck To be worked on next labels May 8, 2024
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few nits and a question.

cpp/include/cudf/io/parquet.hpp Show resolved Hide resolved
python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
@mhaseeb123
Copy link
Member Author

/ok to test

python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved
@mhaseeb123
Copy link
Member Author

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating on this.
Looks great!

@mhaseeb123 mhaseeb123 requested a review from abellina May 9, 2024 17:01
python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
@vuule
Copy link
Contributor

vuule commented May 10, 2024

/ok to test

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 10, 2024
@vuule
Copy link
Contributor

vuule commented May 11, 2024

/merge

@rapids-bot rapids-bot bot merged commit ce1933f into rapidsai:branch-24.06 May 11, 2024
80 checks passed
@mhaseeb123 mhaseeb123 deleted the pq-dictpolicy-and-nvcomp-with-zstd branch May 14, 2024 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Prevent silent fallback to uncompressed when writing parquet files with ZSTD compression
7 participants