-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add skiprows and nrows to parquet reader #16214
Conversation
python/cudf/cudf/io/parquet.py
Outdated
# TODO: is this still right? | ||
# Also, do we still care? | ||
# partition_keys uses pyarrow dataset | ||
# (which we can't use anymore after pyarrow is gone) | ||
nrows=nrows, | ||
skip_rows=skip_rows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think this is wrong. These dfs
are concatenated vertically, so after reading each df, one should do:
nrows = max(nrows - len(df), 0)
Updating skip_rows
is more complicated because if you skipped all the rows in a file then you can't know if you should reduce skip_rows
to zero or to skip_rows - num_rows_in_file
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the nrows and skiprows are not per-file.
That makes sense, thanks for the clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple files are in this regard, I believe, an implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But let's cc @rjzamora to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a feels a bit ugly/inefficient to use nrows
and/or skiprows
when we are reading from a partitioned dataset, but you are right that we wouldn't want to pass these parameters "as provided" to _read_parquet
.
The "efficient" solution would probably have us read the row-count metadata for each element of key_paths
up front. This way we could avoid reading unnecessary files/rows altogether. Of course, if the user passes in a filter, we would need to read all data with the filter and perform the row-trimming after the fact.
which we can't use anymore after pyarrow is gone
My understanding is that the pyarrow-removal effort only applies to the cython/c++ level. We are still allowed to depend on pyarrow at the python level (removing pyarrow would be a nightmare for dask-cudf at this point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a feels a bit ugly/inefficient to use
nrows
and/orskiprows
when we are reading from a partitioned dataset, but you are right that we wouldn't want to pass these parameters "as provided" to_read_parquet
.The "efficient" solution would probably have us read the row-count metadata for each element of
key_paths
up front. This way we could avoid reading unnecessary files/rows altogether.
I think this makes sense. I'm not too familiar with the parquet code/the format in general, but do we just call read_parquet_metadata
on each file to get the row counts? Then we iterate through partitions until we satisfy nrows/skiprows.
Of course, if the user passes in a filter, we would need to read all data with the filter and perform the row-trimming after the fact.
I think it might be possible to optimize further since row group row counts should be available for us.
(but we'd still have to do filtering post read)
At any rate, I think for now it's probably best to punt on this since we don't need this for anything at the moment.
(We can revisit if it turns out e.g. polars/someone else needs this)
At any rate, I think I might punt on this for now (since it doesn't look like making nrows/skiprows work for a partitioned dataset is high priority), and the rest of this PR is blocking nrows/skiprows support in the polars executor.
which we can't use anymore after pyarrow is gone
My understanding is that the pyarrow-removal effort only applies to the cython/c++ level. We are still allowed to depend on pyarrow at the python level (removing pyarrow would be a nightmare for dask-cudf at this point).
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At any rate, I think I might punt on this for now (since it doesn't look like making nrows/skiprows work for a partitioned dataset is high priority), and the rest of this PR is blocking nrows/skiprows support in the polars executor.
Yes - I don't see any reason to spend time optimizing nrows/skiprows for partitioned datasets. We can always revisit if necessary.
Hi @lithomas1, thank you for working on this PR. If not too much overhead, do you mind adding |
Sure, will give it a shot. |
I'm hitting a bug in the chunked parquet reader for skip_rows > 0, so I don't think I can make further progress on this. |
Thanks for working on this. If you are hitting this #16186, then please don't discard your changes. The bug should go away once #16195 merges. If you like you can pull in the changes from it and retest on your end but we can wait until the merge as well to go ahead with this. |
Thanks for the quick fix! (I can't believe I missed that PR. The auto assigner assigned me to review that one too 😅 ) I currently have my changes stashed away in another branch, and I'll wait for your PR to land to merge that branch here. |
Co-authored-by: Muhammad Haseeb <[email protected]>
Is this PR blocked on resolving #16186 or is there a partial version that we want to land in the interim before that issue is fully resolved? |
Not blocked. We can merge this without adding bindings for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor changes needed and should be good to go!
python/cudf/cudf/_lib/parquet.pyx
Outdated
@@ -362,6 +376,8 @@ cpdef read_parquet(filepaths_or_buffers, columns=None, row_groups=None, | |||
filters, | |||
convert_strings_to_categories = False, | |||
use_pandas_metadata = use_pandas_metadata, | |||
skip_rows = skip_rows, | |||
num_rows = nrows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be consistent with either num_rows
or nrows
across the files. @galipremsagar I can't find the same option in pyarrow.read_table
or pd.read_parquet
so I am sure what should be preferred here. If arbitrary, my vote would be num_rows
to be consistent with C++ counterpart but not a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah not sure which is better.
nrows would be consistent with read_csv, and num_rows would be consistent with libcudf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with nrows
then and further the PR to merge! 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per conversation, please use nrows
consistently on python side except when passing to libcudf. Looks good otherwise!
Some tests still failing with the following log. Looks like some more
|
Note that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Will need #16442 in to get past the cudf-polars test fails)
/merge |
Description
closes #15144
Checklist