[FEA] Add more functionality to `cudf.io.read_parquet_metadata` API #11214

galipremsagar · 2022-07-07T15:03:06Z

Is your feature request related to a problem? Please describe.
It would be nicer to have the row-group wise metadata returned instead of returning just the number of row-groups. That way users can identify how many rows & columns are stored in each row-group.

galipremsagar · 2022-07-07T15:07:04Z

@vuule Did you intended to the same pyarrow parquet schema? Like:

(Pdb) x = pq.ParquetFile(fname)
(Pdb) x.metadata
<pyarrow._parquet.FileMetaData object at 0x7fbe0a172040>
  created_by: parquet-cpp-arrow version 8.0.0
  num_columns: 15
  num_rows: 0
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 7481

(Pdb) x.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x7fbe0a172270>
  num_columns: 15
  num_rows: 0
  total_byte_size: 196

or do we want to extract only the necessary bit of details and return those?

vuule · 2022-07-07T15:17:58Z

I like the option to follow the pyarrow metada structure here, if it's not a huge overhead to gather.

galipremsagar · 2022-07-07T15:35:07Z

Yea, should not be an issue since we already tap into this API anyways: https://github.com/rapidsai/cudf/blob/branch-22.08/python/cudf/cudf/io/parquet.py#L199

galipremsagar · 2022-07-07T15:35:36Z

cc: @rjzamora for visibility

github-actions · 2022-08-06T16:03:02Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Closes #11675 Adds `read_parquet_metadata` to libcudf. The metadata has following information - schema - (type, name, children) - num_rows - num_rowgroups - key-value string metadata in file footer To Reviewers: Request for adding more information in metadata. Refer #11214 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) - Divye Gala (https://github.com/divyegala) - Ray Douglass (https://github.com/raydouglass) URL: #13663

GregoryKimball · 2024-02-16T22:44:40Z

That way users can identify how many rows & columns are stored in each row-group.

Now that we have read_parquet_metadata from #13663, could we re-scope this issue to specify the changes we would like to see in the parquet_metadata class?

As I understand, the number of columns will be identical for all row groups. We could add row counts for each row group as a new vector, or perhaps to metadata. Are the row group min/max statistics stored as key-value pairs in parquet_metadata.metadata?

…tract `RowGroup` information (#15398) The `cudf.io.read_parquet_metadata` is now bound to corresponding libcudf API instead of relying on pyarrow. The libcudf API now also returns high level `RowGroup` metadata to solve #11214. Added additional tests and doc updates as well. More metadata information such `min, max` values for each column in each row group can also be extracted and returned if needed. Thoughts? Recommend: Closing #15320 without merging in favor of this PR. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #15398

mhaseeb123 · 2024-04-30T01:43:25Z

Closed by #15398

galipremsagar added feature request New feature or request Python Affects Python cuDF API. labels Jul 7, 2022

galipremsagar self-assigned this Jul 7, 2022

galipremsagar mentioned this issue Jul 7, 2022

[REVIEW] Deprecate skiprows & num_rows in parquet reader #11218

Merged

1 task

github-actions bot added the inactive-30d label Aug 6, 2022

GregoryKimball added this to the Parquet continuous improvement milestone Nov 19, 2022

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed inactive-30d labels Apr 3, 2023

karthikeyann mentioned this issue Jul 5, 2023

Add read_parquet_metadata libcudf API #13663

Merged

3 tasks

GregoryKimball assigned mhaseeb123 and unassigned galipremsagar Mar 14, 2024

mhaseeb123 mentioned this issue Mar 15, 2024

Add RowGroupMetaData information to cudf.io.read_parquet_metadata #15320

Closed

3 tasks

mhaseeb123 mentioned this issue Mar 27, 2024

Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information #15398

Merged

3 tasks

mhaseeb123 closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add more functionality to `cudf.io.read_parquet_metadata` API #11214

[FEA] Add more functionality to `cudf.io.read_parquet_metadata` API #11214

galipremsagar commented Jul 7, 2022

galipremsagar commented Jul 7, 2022 •

edited

Loading

vuule commented Jul 7, 2022

galipremsagar commented Jul 7, 2022

galipremsagar commented Jul 7, 2022

github-actions bot commented Aug 6, 2022

GregoryKimball commented Feb 16, 2024 •

edited

Loading

mhaseeb123 commented Apr 30, 2024

[FEA] Add more functionality to cudf.io.read_parquet_metadata API #11214

[FEA] Add more functionality to cudf.io.read_parquet_metadata API #11214

Comments

galipremsagar commented Jul 7, 2022

galipremsagar commented Jul 7, 2022 • edited Loading

vuule commented Jul 7, 2022

galipremsagar commented Jul 7, 2022

galipremsagar commented Jul 7, 2022

github-actions bot commented Aug 6, 2022

GregoryKimball commented Feb 16, 2024 • edited Loading

mhaseeb123 commented Apr 30, 2024

[FEA] Add more functionality to `cudf.io.read_parquet_metadata` API #11214

[FEA] Add more functionality to `cudf.io.read_parquet_metadata` API #11214

galipremsagar commented Jul 7, 2022 •

edited

Loading

GregoryKimball commented Feb 16, 2024 •

edited

Loading