-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add more functionality to cudf.io.read_parquet_metadata
API
#11214
Comments
@vuule Did you intended to the same pyarrow parquet schema? Like: (Pdb) x = pq.ParquetFile(fname)
(Pdb) x.metadata
<pyarrow._parquet.FileMetaData object at 0x7fbe0a172040>
created_by: parquet-cpp-arrow version 8.0.0
num_columns: 15
num_rows: 0
num_row_groups: 1
format_version: 1.0
serialized_size: 7481
(Pdb) x.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x7fbe0a172270>
num_columns: 15
num_rows: 0
total_byte_size: 196 or do we want to extract only the necessary bit of details and return those? |
I like the option to follow the pyarrow metada structure here, if it's not a huge overhead to gather. |
Yea, should not be an issue since we already tap into this API anyways: https://github.com/rapidsai/cudf/blob/branch-22.08/python/cudf/cudf/io/parquet.py#L199 |
cc: @rjzamora for visibility |
This issue has been labeled |
Closes #11675 Adds `read_parquet_metadata` to libcudf. The metadata has following information - schema - (type, name, children) - num_rows - num_rowgroups - key-value string metadata in file footer To Reviewers: Request for adding more information in metadata. Refer #11214 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) - Divye Gala (https://github.com/divyegala) - Ray Douglass (https://github.com/raydouglass) URL: #13663
Now that we have As I understand, the number of columns will be identical for all row groups. We could add row counts for each row group as a new vector, or perhaps to |
…tract `RowGroup` information (#15398) The `cudf.io.read_parquet_metadata` is now bound to corresponding libcudf API instead of relying on pyarrow. The libcudf API now also returns high level `RowGroup` metadata to solve #11214. Added additional tests and doc updates as well. More metadata information such `min, max` values for each column in each row group can also be extracted and returned if needed. Thoughts? Recommend: Closing #15320 without merging in favor of this PR. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #15398
Closed by #15398 |
Is your feature request related to a problem? Please describe.
It would be nicer to have the row-group wise metadata returned instead of returning just the number of row-groups. That way users can identify how many rows & columns are stored in each row-group.
The text was updated successfully, but these errors were encountered: