[REVIEW] Deprecate `skiprows` & `num_rows` in parquet reader #11218

galipremsagar · 2022-07-07T20:59:50Z

This PR:

Deprecates skiprows & num_rows from cudf parquet reader (cudf.read_parquet) since these parameters are adding to a lot of overhead incase of nested types and also not supported in pd.read_parquet

galipremsagar · 2022-07-07T21:02:01Z

cc: @rjzamora I marked this as a breaking change since I see that merlin relies on cudf.io.read_parquet_metadata

codecov · 2022-07-07T22:48:02Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@dd7e955). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #11218   +/-   ##
===============================================
  Coverage                ?   86.37%           
===============================================
  Files                   ?      144           
  Lines                   ?    22830           
  Branches                ?        0           
===============================================
  Hits                    ?    19719           
  Misses                  ?     3111           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd7e955...52643f3. Read the comment docs.

python/cudf/cudf/tests/test_parquet.py

vuule

Looks good!

rjzamora

Thanks @galipremsagar - I have a small suggestion and some comments, but this change makes sense to me.

python/cudf/cudf/io/parquet.py

python/cudf/cudf/utils/ioutils.py

rjzamora · 2022-07-19T15:06:31Z

This is looking good @galipremsagar - However, while looking things over just now, I realized that it may be nice to mirror the behavior of read_parquet a bit more, and support the same types as filepath_or_buffer, rather than a single pathlike argument for path. That is, we could use the same code that read_parquet does to generate a list of buffers (using storage_options), and then we can return a single (consolidated) pyarrow.parquet.FileMetaData object.

Any thoughts on this idea? (I'm sorry to be indecisive about the API here, but I'd like to make absolutely sure we are happy with the new behavior before we introduce a "breaking" change).

vyasr

Couple of minor comments. Will wait to approve pending discussion of Rick's question.

vyasr · 2022-07-19T17:46:57Z

python/cudf/cudf/io/parquet.py

+    fs = (
+        fs
+        or fsspec.core.get_fs_token_paths(
+            path, storage_options=storage_options or {}
+        )[0]
+    )


This is a bit hard to read. Maybe better to use an explicit if statement here

Suggested change

fs = (

fs

or fsspec.core.get_fs_token_paths(

path, storage_options=storage_options or {}

)[0]

)

if fs is None:

fs = fsspec.core.get_fs_token_paths(

path, storage_options=storage_options or {}

)[0]

vyasr · 2022-07-19T17:49:43Z

python/cudf/cudf/tests/test_parquet.py

@@ -439,11 +439,13 @@ def num_row_groups(rows, group_size):
    row_group_size = 5
    pdf.to_parquet(fname, compression="snappy", row_group_size=row_group_size)

-    num_rows, row_groups, col_names = cudf.io.read_parquet_metadata(fname)
+    parquet_metadata = cudf.io.read_parquet_metadata(fname)


Minor nit, you call this file_metadata in every other test.

vyasr · 2022-07-19T17:55:52Z

python/cudf/cudf/utils/ioutils.py

-Total number of rows
-Number of row groups
-List of column names
+pyarrow.parquet.FileMetaData


Do we have precedent for returning pyarrow types? Probably not an issue, just checking.

This is a good question - There is no precedent that I know about, but we would need a new C++/Cython API to return a cudf-based object with the same kind of information. There may be some performance benefits to this (perhaps more-efficient metadata aggregation across multiple files), but that overall effort may not gain us much beyond adding more code to maintain. Another alternative may be to return a pd/cudf.DataFrame summary of the Parqet metadata and encourage Pandas to adopt the same API.

On a related note, I am struggling a bit with the question of whether this API should exist in cudf at all, because (1) we are just returning a FileMetaData object without using cudf in any meaningful way, and (2) this API doesn't even exist in Pandas.

Right, I am not a cuIO person so it's hard for me to say since I don't know the typical scope here, but it feels like this belongs in pyarrow or something rather than cudf. Maybe that's more of a long-term discussion, though. Not sure how @galipremsagar or @shwina feel.

rjzamora · 2022-07-20T18:26:44Z

@galipremsagar - I’d like to apologize for stalling this PR, but I'm still unsure if we should merge this particular version of the read_parquet_metadata API. I am completely on board with the read_parquet argument deprecations.

To expand on my read_parquet_metadata concern: Since this is a breaking change, I want to make sure we are not passing up the opportunity to introduce a more valuable API to cudf (and possibly Pandas). For example, @vyasr 's question about returning a pyarrow object made me realize that we are not really adding any value to the existing pq.ParquetFile API (beyond letting the user avoid an explicit pyarrow import). Therefore, it may make sense to return something other than a FileMetaData object if we can come up with a “more-useful” way to organize the Parquet-metadata from a pandas/cudf perspective. 

That said, I am struggling a bit to come up with a good way to organize the metadata in a DataFrame-friendly way. Ideally the metadata summary would be intuitive and useful enough for Pandas to adopt.

Example Possibility

(I’m sure we can come up with something slightly cleaner, but here is an example for the sake of discussion)  

source = “path/to/timeseries/data”
schema, row_groups, column_statistics = read_parquet_metadata(source, column_statistics=[“timestamp”])

Here,schema is a summary of the columns and dtypes (similar to read_parquet(source).dtypes) :

id                    int64
x                   float64
y                   float64
timestamp    datetime64[us]
dtype: object

row_groups is a summary of row-group sizes and relative paths:

   local_id  num_rows  byte_size            path
0         0     86400    2713837  part.0.parquet
1         0     86400    2713845  part.1.parquet
…

and column_statistics is a summary of column-chunk statistics:

                                           timestamp
0  {'distinct_count': 0, 'has_min_max': True, 'ma...
1  {'distinct_count': 0, 'has_min_max': True, 'ma...
…

galipremsagar · 2022-07-21T17:31:58Z

@galipremsagar - I’d like to apologize for stalling this PR, but I'm still unsure if we should merge this particular version of the read_parquet_metadata API. I am completely on board with the read_parquet argument deprecations.

To expand on my read_parquet_metadata concern: Since this is a breaking change, I want to make sure we are not passing up the opportunity to introduce a more valuable API to cudf (and possibly Pandas). For example, @vyasr 's question about returning a pyarrow object made me realize that we are not really adding any value to the existing pq.ParquetFile API (beyond letting the user avoid an explicit pyarrow import). Therefore, it may make sense to return something other than a FileMetaData object if we can come up with a “more-useful” way to organize the Parquet-metadata from a pandas/cudf perspective. 

That said, I am struggling a bit to come up with a good way to organize the metadata in a DataFrame-friendly way. Ideally the metadata summary would be intuitive and useful enough for Pandas to adopt.

Example Possibility

(I’m sure we can come up with something slightly cleaner, but here is an example for the sake of discussion)  
source = “path/to/timeseries/data”
schema, row_groups, column_statistics = read_parquet_metadata(source, column_statistics=[“timestamp”])
Here,schema is a summary of the columns and dtypes (similar to read_parquet(source).dtypes) :
id                    int64
x                   float64
y                   float64
timestamp    datetime64[us]
dtype: object
row_groups is a summary of row-group sizes and relative paths:
   local_id  num_rows  byte_size            path
0         0     86400    2713837  part.0.parquet
1         0     86400    2713845  part.1.parquet
…
and column_statistics is a summary of column-chunk statistics:
                                           timestamp
0  {'distinct_count': 0, 'has_min_max': True, 'ma...
1  {'distinct_count': 0, 'has_min_max': True, 'ma...
…

I think I like this idea, will share my thoughts in #11214 too for a broader API re-design on these lines. So dropped the read_parquet_metadata changes in this PR.

vyasr · 2022-07-21T20:58:51Z

I'm happier with this as a pure deprecation PR while we figure out the appropriate scope and home for the metadata logic. It does seem to me like code that we should be eventually aiming to upstream into another library rather than living in cuDF, but even if it does end up in cuDF it could use some more consideration of exactly what data it includes along the lines of what Rick is saying.

rjzamora

Sounds good @galipremsagar and @vyasr !

galipremsagar · 2022-07-21T21:39:19Z

@gpucibot merge

…1480) This PR removes support for `skiprows` & `num_rows` in parquet reader. A continuation of #11218 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Vukasin Milovanovic (https://github.com/vuule) URL: #11480

galipremsagar added 2 commits July 7, 2022 13:30

return rowgroup metadata too

65eeab8

add deprecation

098a7b3

galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer labels Jul 7, 2022

galipremsagar requested a review from a team as a code owner July 7, 2022 20:59

galipremsagar self-assigned this Jul 7, 2022

galipremsagar requested review from shwina and skirui-source July 7, 2022 20:59

galipremsagar added 4 - Needs cuIO Reviewer breaking Breaking change improvement Improvement / enhancement to an existing function labels Jul 7, 2022

galipremsagar changed the title ~~[REVIEW] Return rowgroup metadata in cudf.io.read_parquet_metadata and deprecated skiprows & num_rows in parquet reader~~ [REVIEW] Return rowgroup metadata in cudf.io.read_parquet_metadata and deprecate skiprows & num_rows in parquet reader Jul 7, 2022

vuule reviewed Jul 8, 2022

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

galipremsagar added 2 commits July 8, 2022 06:36

Merge remote-tracking branch 'upstream/branch-22.08' into 11214

7be0652

remove num_row_groups

c4631f3

vuule approved these changes Jul 13, 2022

View reviewed changes

galipremsagar removed the 4 - Needs cuIO Reviewer label Jul 13, 2022

rjzamora reviewed Jul 14, 2022

View reviewed changes

python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved

python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved

python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

galipremsagar added 2 commits July 18, 2022 12:10

Merge remote-tracking branch 'upstream/branch-22.08' into 11214

db1d7ae

address reviews

836ff73

galipremsagar requested a review from rjzamora July 18, 2022 20:02

shwina reviewed Jul 19, 2022

View reviewed changes

python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

galipremsagar commented Jul 19, 2022

View reviewed changes

python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

Update python/cudf/cudf/utils/ioutils.py

b1f18a7

vyasr reviewed Jul 19, 2022

View reviewed changes

galipremsagar added 2 commits July 21, 2022 10:29

revert read_parquet_metadata changes

6a9a9f3

revert read_parquet_metadata changes

52643f3

galipremsagar changed the title ~~[REVIEW] Return rowgroup metadata in cudf.io.read_parquet_metadata and deprecate skiprows & num_rows in parquet reader~~ [REVIEW] Deprecate skiprows & num_rows in parquet reader Jul 21, 2022

rjzamora approved these changes Jul 21, 2022

View reviewed changes

vyasr approved these changes Jul 21, 2022

View reviewed changes

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer breaking Breaking change labels Jul 21, 2022

galipremsagar added the non-breaking Non-breaking change label Jul 21, 2022

rapids-bot bot merged commit 6a07e75 into rapidsai:branch-22.08 Jul 21, 2022

galipremsagar mentioned this pull request Aug 5, 2022

[REVIEW] Drop support for skiprows and num_rows in cudf.read_parquet #11480

Merged

3 tasks

GregoryKimball mentioned this pull request Feb 26, 2024

[FEA] Add python bindings in the parquet reader for num_rows/skiprows #15144

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Deprecate `skiprows` & `num_rows` in parquet reader #11218

[REVIEW] Deprecate `skiprows` & `num_rows` in parquet reader #11218

galipremsagar commented Jul 7, 2022 •

edited

Loading

galipremsagar commented Jul 7, 2022

codecov bot commented Jul 7, 2022 •

edited

Loading

vuule left a comment

rjzamora left a comment

rjzamora commented Jul 19, 2022

vyasr left a comment

vyasr Jul 19, 2022

vyasr Jul 19, 2022

vyasr Jul 19, 2022

rjzamora Jul 19, 2022

vyasr Jul 20, 2022

rjzamora commented Jul 20, 2022

galipremsagar commented Jul 21, 2022

vyasr commented Jul 21, 2022

rjzamora left a comment

galipremsagar commented Jul 21, 2022

[REVIEW] Deprecate skiprows & num_rows in parquet reader #11218

[REVIEW] Deprecate skiprows & num_rows in parquet reader #11218

Conversation

galipremsagar commented Jul 7, 2022 • edited Loading

galipremsagar commented Jul 7, 2022

codecov bot commented Jul 7, 2022 • edited Loading

Codecov Report

vuule left a comment

Choose a reason for hiding this comment

rjzamora left a comment

Choose a reason for hiding this comment

rjzamora commented Jul 19, 2022

vyasr left a comment

Choose a reason for hiding this comment

vyasr Jul 19, 2022

Choose a reason for hiding this comment

vyasr Jul 19, 2022

Choose a reason for hiding this comment

vyasr Jul 19, 2022

Choose a reason for hiding this comment

rjzamora Jul 19, 2022

Choose a reason for hiding this comment

vyasr Jul 20, 2022

Choose a reason for hiding this comment

rjzamora commented Jul 20, 2022

galipremsagar commented Jul 21, 2022

vyasr commented Jul 21, 2022

rjzamora left a comment

Choose a reason for hiding this comment

galipremsagar commented Jul 21, 2022

[REVIEW] Deprecate `skiprows` & `num_rows` in parquet reader #11218

[REVIEW] Deprecate `skiprows` & `num_rows` in parquet reader #11218

galipremsagar commented Jul 7, 2022 •

edited

Loading

codecov bot commented Jul 7, 2022 •

edited

Loading