Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare dask_cudf test_parquet.py for upcoming API changes #10709

Merged
merged 67 commits into from
Apr 28, 2022

Conversation

rjzamora
Copy link
Member

This is a relatively-simple PR to clean up dask_cudf's to/read_parquet tests. These changes are mostly meant to avoid future test failures that will arise after impending changes are implemented in up-stream Dask. These changes include:

  • The default value for write_metadata_file will become False for to_parquet (because writing the _metadata file scales very poorly)
  • The default value for split_row_groups will become False (because this setting is typically optimal when the file are not too large). Users with larger-than-memory files will need to specify split_row_groups=True/int explicitly.
  • The gather_statistics argument will be removed in favor of a more descriptive calculate_divisions argument.

This PR also removes the long-deprecated row_groups_per_part argument from dask_cudf.read_parquet (established replacement is split_row_groups).

raydouglass and others added 30 commits March 30, 2020 11:03
Merge pull request rapidsai#5690 from ajschmidt8/phase2
[skip ci] Update master references for main branch
[RELEASE] Re-release v0.15 cudf [skip-ci]
[RELEASE] v0.18.2 `cudf` release [skip-ci]
@quasiben
Copy link
Member

LGTM . I'll also ping @randerzander directly to get his comments on the changes

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Apr 26, 2022
python/dask_cudf/dask_cudf/io/parquet.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/io/parquet.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/io/parquet.py Outdated Show resolved Hide resolved
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like auto-generated changelog modifications need to be reverted.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@rjzamora
Copy link
Member Author

Looks like auto-generated changelog modifications need to be reverted.

Oops - Thanks for pointing this out @galipremsagar !

@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Apr 27, 2022
@rjzamora
Copy link
Member Author

@randerzander - It looks like this PR is now blocking cudf CI (the upstream changes have begun). So, let me know if the current changes are "ok" for now.

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Apr 28, 2022
@rjzamora
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 03d419d into rapidsai:branch-22.06 Apr 28, 2022
@rjzamora rjzamora deleted the remove-row_groups_per_part branch April 28, 2022 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge dask Dask issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants