Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Deprecate skiprows & num_rows in parquet reader #11218

Merged
merged 9 commits into from
Jul 21, 2022

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Jul 7, 2022

This PR:

  • Deprecates skiprows & num_rows from cudf parquet reader (cudf.read_parquet) since these parameters are adding to a lot of overhead incase of nested types and also not supported in pd.read_parquet

@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer labels Jul 7, 2022
@galipremsagar galipremsagar requested a review from a team as a code owner July 7, 2022 20:59
@galipremsagar galipremsagar self-assigned this Jul 7, 2022
@galipremsagar galipremsagar added 4 - Needs cuIO Reviewer breaking Breaking change improvement Improvement / enhancement to an existing function labels Jul 7, 2022
@galipremsagar
Copy link
Contributor Author

cc: @rjzamora I marked this as a breaking change since I see that merlin relies on cudf.io.read_parquet_metadata

@galipremsagar galipremsagar changed the title [REVIEW] Return rowgroup metadata in cudf.io.read_parquet_metadata and deprecated skiprows & num_rows in parquet reader [REVIEW] Return rowgroup metadata in cudf.io.read_parquet_metadata and deprecate skiprows & num_rows in parquet reader Jul 7, 2022
@codecov
Copy link

codecov bot commented Jul 7, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@dd7e955). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #11218   +/-   ##
===============================================
  Coverage                ?   86.37%           
===============================================
  Files                   ?      144           
  Lines                   ?    22830           
  Branches                ?        0           
===============================================
  Hits                    ?    19719           
  Misses                  ?     3111           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd7e955...52643f3. Read the comment docs.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @galipremsagar - I have a small suggestion and some comments, but this change makes sense to me.

python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
@galipremsagar galipremsagar requested a review from rjzamora July 18, 2022 20:02
@rjzamora
Copy link
Member

This is looking good @galipremsagar - However, while looking things over just now, I realized that it may be nice to mirror the behavior of read_parquet a bit more, and support the same types as filepath_or_buffer, rather than a single pathlike argument for path. That is, we could use the same code that read_parquet does to generate a list of buffers (using storage_options), and then we can return a single (consolidated) pyarrow.parquet.FileMetaData object.

Any thoughts on this idea? (I'm sorry to be indecisive about the API here, but I'd like to make absolutely sure we are happy with the new behavior before we introduce a "breaking" change).

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor comments. Will wait to approve pending discussion of Rick's question.

Comment on lines 198 to 203
fs = (
fs
or fsspec.core.get_fs_token_paths(
path, storage_options=storage_options or {}
)[0]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hard to read. Maybe better to use an explicit if statement here

Suggested change
fs = (
fs
or fsspec.core.get_fs_token_paths(
path, storage_options=storage_options or {}
)[0]
)
if fs is None:
fs = fsspec.core.get_fs_token_paths(
path, storage_options=storage_options or {}
)[0]

@@ -439,11 +439,13 @@ def num_row_groups(rows, group_size):
row_group_size = 5
pdf.to_parquet(fname, compression="snappy", row_group_size=row_group_size)

num_rows, row_groups, col_names = cudf.io.read_parquet_metadata(fname)
parquet_metadata = cudf.io.read_parquet_metadata(fname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, you call this file_metadata in every other test.

Total number of rows
Number of row groups
List of column names
pyarrow.parquet.FileMetaData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have precedent for returning pyarrow types? Probably not an issue, just checking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question - There is no precedent that I know about, but we would need a new C++/Cython API to return a cudf-based object with the same kind of information. There may be some performance benefits to this (perhaps more-efficient metadata aggregation across multiple files), but that overall effort may not gain us much beyond adding more code to maintain. Another alternative may be to return a pd/cudf.DataFrame summary of the Parqet metadata and encourage Pandas to adopt the same API.

On a related note, I am struggling a bit with the question of whether this API should exist in cudf at all, because (1) we are just returning a FileMetaData object without using cudf in any meaningful way, and (2) this API doesn't even exist in Pandas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I am not a cuIO person so it's hard for me to say since I don't know the typical scope here, but it feels like this belongs in pyarrow or something rather than cudf. Maybe that's more of a long-term discussion, though. Not sure how @galipremsagar or @shwina feel.

@rjzamora
Copy link
Member

@galipremsagar - I’d like to apologize for stalling this PR, but I'm still unsure if we should merge this particular version of the read_parquet_metadata API. I am completely on board with the read_parquet argument deprecations.

To expand on my read_parquet_metadata concern: Since this is a breaking change, I want to make sure we are not passing up the opportunity to introduce a more valuable API to cudf (and possibly Pandas). For example, @vyasr 's question about returning a pyarrow object made me realize that we are not really adding any value to the existing pq.ParquetFile API (beyond letting the user avoid an explicit pyarrow import). Therefore, it may make sense to return something other than a FileMetaData object if we can come up with a “more-useful” way to organize the Parquet-metadata from a pandas/cudf perspective.


That said, I am struggling a bit to come up with a good way to organize the metadata in a DataFrame-friendly way. Ideally the metadata summary would be intuitive and useful enough for Pandas to adopt.

Example Possibility

(I’m sure we can come up with something slightly cleaner, but here is an example for the sake of discussion)



source =path/to/timeseries/dataschema, row_groups, column_statistics = read_parquet_metadata(source, column_statistics=[“timestamp”])

Here,schema is a summary of the columns and dtypes (similar to read_parquet(source).dtypes) :

id                    int64
x                   float64
y                   float64
timestamp    datetime64[us]
dtype: object

row_groups is a summary of row-group sizes and relative paths:

   local_id  num_rows  byte_size            path
0         0     86400    2713837  part.0.parquet
1         0     86400    2713845  part.1.parquet
…

and column_statistics is a summary of column-chunk statistics:

                                           timestamp
0  {'distinct_count': 0, 'has_min_max': True, 'ma...
1  {'distinct_count': 0, 'has_min_max': True, 'ma...
…

@galipremsagar galipremsagar changed the title [REVIEW] Return rowgroup metadata in cudf.io.read_parquet_metadata and deprecate skiprows & num_rows in parquet reader [REVIEW] Deprecate skiprows & num_rows in parquet reader Jul 21, 2022
@galipremsagar
Copy link
Contributor Author

@galipremsagar - I’d like to apologize for stalling this PR, but I'm still unsure if we should merge this particular version of the read_parquet_metadata API. I am completely on board with the read_parquet argument deprecations.

To expand on my read_parquet_metadata concern: Since this is a breaking change, I want to make sure we are not passing up the opportunity to introduce a more valuable API to cudf (and possibly Pandas). For example, @vyasr 's question about returning a pyarrow object made me realize that we are not really adding any value to the existing pq.ParquetFile API (beyond letting the user avoid an explicit pyarrow import). Therefore, it may make sense to return something other than a FileMetaData object if we can come up with a “more-useful” way to organize the Parquet-metadata from a pandas/cudf perspective.


That said, I am struggling a bit to come up with a good way to organize the metadata in a DataFrame-friendly way. Ideally the metadata summary would be intuitive and useful enough for Pandas to adopt.

Example Possibility

(I’m sure we can come up with something slightly cleaner, but here is an example for the sake of discussion)



source =path/to/timeseries/dataschema, row_groups, column_statistics = read_parquet_metadata(source, column_statistics=[“timestamp”])

Here,schema is a summary of the columns and dtypes (similar to read_parquet(source).dtypes) :

id                    int64
x                   float64
y                   float64
timestamp    datetime64[us]
dtype: object

row_groups is a summary of row-group sizes and relative paths:

   local_id  num_rows  byte_size            path
0         0     86400    2713837  part.0.parquet
1         0     86400    2713845  part.1.parquet
…

and column_statistics is a summary of column-chunk statistics:

                                           timestamp
0  {'distinct_count': 0, 'has_min_max': True, 'ma...
1  {'distinct_count': 0, 'has_min_max': True, 'ma...
…

I think I like this idea, will share my thoughts in #11214 too for a broader API re-design on these lines. So dropped the read_parquet_metadata changes in this PR.

@vyasr
Copy link
Contributor

vyasr commented Jul 21, 2022

I'm happier with this as a pure deprecation PR while we figure out the appropriate scope and home for the metadata logic. It does seem to me like code that we should be eventually aiming to upstream into another library rather than living in cuDF, but even if it does end up in cuDF it could use some more consideration of exactly what data it includes along the lines of what Rick is saying.

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good @galipremsagar and @vyasr !

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer breaking Breaking change labels Jul 21, 2022
@galipremsagar
Copy link
Contributor Author

@gpucibot merge

@galipremsagar galipremsagar added the non-breaking Non-breaking change label Jul 21, 2022
@rapids-bot rapids-bot bot merged commit 6a07e75 into rapidsai:branch-22.08 Jul 21, 2022
rapids-bot bot pushed a commit that referenced this pull request Aug 5, 2022
…1480)

This PR removes support for `skiprows` & `num_rows` in parquet reader. A continuation of #11218

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11480
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants