-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Deprecate skiprows
& num_rows
in parquet reader
#11218
Conversation
cc: @rjzamora I marked this as a breaking change since I see that merlin relies on |
cudf.io.read_parquet_metadata
and deprecated skiprows
& num_rows
in parquet readercudf.io.read_parquet_metadata
and deprecate skiprows
& num_rows
in parquet reader
Codecov Report
@@ Coverage Diff @@
## branch-22.08 #11218 +/- ##
===============================================
Coverage ? 86.37%
===============================================
Files ? 144
Lines ? 22830
Branches ? 0
===============================================
Hits ? 19719
Misses ? 3111
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @galipremsagar - I have a small suggestion and some comments, but this change makes sense to me.
This is looking good @galipremsagar - However, while looking things over just now, I realized that it may be nice to mirror the behavior of Any thoughts on this idea? (I'm sorry to be indecisive about the API here, but I'd like to make absolutely sure we are happy with the new behavior before we introduce a "breaking" change). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor comments. Will wait to approve pending discussion of Rick's question.
python/cudf/cudf/io/parquet.py
Outdated
fs = ( | ||
fs | ||
or fsspec.core.get_fs_token_paths( | ||
path, storage_options=storage_options or {} | ||
)[0] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit hard to read. Maybe better to use an explicit if
statement here
fs = ( | |
fs | |
or fsspec.core.get_fs_token_paths( | |
path, storage_options=storage_options or {} | |
)[0] | |
) | |
if fs is None: | |
fs = fsspec.core.get_fs_token_paths( | |
path, storage_options=storage_options or {} | |
)[0] |
@@ -439,11 +439,13 @@ def num_row_groups(rows, group_size): | |||
row_group_size = 5 | |||
pdf.to_parquet(fname, compression="snappy", row_group_size=row_group_size) | |||
|
|||
num_rows, row_groups, col_names = cudf.io.read_parquet_metadata(fname) | |||
parquet_metadata = cudf.io.read_parquet_metadata(fname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, you call this file_metadata
in every other test.
python/cudf/cudf/utils/ioutils.py
Outdated
Total number of rows | ||
Number of row groups | ||
List of column names | ||
pyarrow.parquet.FileMetaData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have precedent for returning pyarrow types? Probably not an issue, just checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question - There is no precedent that I know about, but we would need a new C++/Cython API to return a cudf-based object with the same kind of information. There may be some performance benefits to this (perhaps more-efficient metadata aggregation across multiple files), but that overall effort may not gain us much beyond adding more code to maintain. Another alternative may be to return a pd/cudf.DataFrame
summary of the Parqet metadata and encourage Pandas to adopt the same API.
On a related note, I am struggling a bit with the question of whether this API should exist in cudf at all, because (1) we are just returning a FileMetaData
object without using cudf in any meaningful way, and (2) this API doesn't even exist in Pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I am not a cuIO person so it's hard for me to say since I don't know the typical scope here, but it feels like this belongs in pyarrow or something rather than cudf. Maybe that's more of a long-term discussion, though. Not sure how @galipremsagar or @shwina feel.
@galipremsagar - I’d like to apologize for stalling this PR, but I'm still unsure if we should merge this particular version of the To expand on my That said, I am struggling a bit to come up with a good way to organize the metadata in a DataFrame-friendly way. Ideally the metadata summary would be intuitive and useful enough for Pandas to adopt. Example Possibility (I’m sure we can come up with something slightly cleaner, but here is an example for the sake of discussion) source = “path/to/timeseries/data”
schema, row_groups, column_statistics = read_parquet_metadata(source, column_statistics=[“timestamp”]) Here,
and
|
cudf.io.read_parquet_metadata
and deprecate skiprows
& num_rows
in parquet readerskiprows
& num_rows
in parquet reader
I think I like this idea, will share my thoughts in #11214 too for a broader API re-design on these lines. So dropped the |
I'm happier with this as a pure deprecation PR while we figure out the appropriate scope and home for the metadata logic. It does seem to me like code that we should be eventually aiming to upstream into another library rather than living in cuDF, but even if it does end up in cuDF it could use some more consideration of exactly what data it includes along the lines of what Rick is saying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good @galipremsagar and @vyasr !
@gpucibot merge |
…1480) This PR removes support for `skiprows` & `num_rows` in parquet reader. A continuation of #11218 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Vukasin Milovanovic (https://github.com/vuule) URL: #11480
This PR:
skiprows
&num_rows
from cudf parquet reader (cudf.read_parquet
) since these parameters are adding to a lot of overhead incase of nested types and also not supported inpd.read_parquet