Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RowGroupMetaData information to cudf.io.read_parquet_metadata #15320

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Mar 15, 2024

Description

The cudf.io.read_parquet_metadata now also returns a list of RowGroupMetaData objects for all row groups in the dataset resolving #11214. Added additional tests and doc updates as well.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mhaseeb123 mhaseeb123 requested a review from a team as a code owner March 15, 2024 17:51
@mhaseeb123 mhaseeb123 requested review from mroeschke and isVoid March 15, 2024 17:51
Copy link

copy-pr-bot bot commented Mar 15, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 15, 2024
@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team cuIO cuIO issue breaking Breaking change improvement Improvement / enhancement to an existing function and removed 3 - Ready for Review Ready for review by team labels Mar 15, 2024
@galipremsagar
Copy link
Contributor

@mhaseeb123 Could you please re-target this to 24.06 branch?

@mhaseeb123 mhaseeb123 changed the base branch from branch-24.04 to branch-24.06 March 19, 2024 19:18
@mhaseeb123
Copy link
Member Author

@mhaseeb123 Could you please re-target this to 24.06 branch?

Done!

@galipremsagar
Copy link
Contributor

/okay to test

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, couple of minor syntax comments..

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
@mhaseeb123
Copy link
Member Author

Looks good to me, couple of minor syntax comments..

Thanks for the suggestions. All applied to the PR.

@galipremsagar
Copy link
Contributor

/okay to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW (I'm not authorized to review Python code)

@galipremsagar
Copy link
Contributor

@mhaseeb123 you will have to fix the following style check failures for CI to run all the jobs: https://github.com/rapidsai/cudf/actions/runs/8349021043/job/22852262213?pr=15320#step:6:177

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Mar 19, 2024

Hello @mhaseeb123, please excuse the late feedback on this. I'm also hoping to expand our C++ parquet metadata tools as part of solving #11214. Please check out read_parquet_metadata and some of the metadata processing in src/io/parquet. Please refer to #13663 for the first pass on metadata tools that @karthikeyann contributed.

EDIT: this looks good! We can keep C++ changes in a follow-on PR

@galipremsagar
Copy link
Contributor

/okay to test

@galipremsagar
Copy link
Contributor

/merge

@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123 mhaseeb123 added 5 - DO NOT MERGE Hold off on merging; see PR for details and removed 3 - Ready for Review Ready for review by team labels Mar 27, 2024
@mhaseeb123
Copy link
Member Author

#15398 supersedes this PR and provides desired functionality directly from libcudf instead of relying on pyarrow.

@GregoryKimball
Copy link
Contributor

Superseded by #15398

rapids-bot bot pushed a commit that referenced this pull request Apr 17, 2024
…tract `RowGroup` information (#15398)

The `cudf.io.read_parquet_metadata` is now bound to corresponding libcudf API instead of relying on pyarrow. The libcudf API now also returns high level `RowGroup` metadata to solve #11214. Added additional tests and doc updates as well. 

More metadata information such `min, max` values for each column in each row group can also be extracted and returned if needed. Thoughts? 

Recommend: Closing #15320 without merging in favor of this PR.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #15398
@mhaseeb123 mhaseeb123 deleted the features-for-parquet-metadata branch April 17, 2024 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - DO NOT MERGE Hold off on merging; see PR for details breaking Breaking change cuIO cuIO issue improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants