-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nested column selection to parquet reader #8933
Add nested column selection to parquet reader #8933
Conversation
columns should be a list of list only for cudf engine
Large number of columns (n) would require a top level search of n in a vector of n children of root schema. Making this an O(n^2) operation. For 5000 columns with no selection, this took 52ms, more than the actual reading time.
Because parquet spec is flexible about it
cc @jlowe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use BytesIO
instead of writing to disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review part 1/2
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #8933 +/- ##
===============================================
Coverage ? 10.71%
===============================================
Files ? 114
Lines ? 19103
Branches ? 0
===============================================
Hits ? 2046
Misses ? 17057
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general, got some questions/nitpicks.
Not 100% sure I fully understood the logic in the recursive function, will go over it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nitpicks, looks great overall!
rerun tests |
rerun tests |
@gpucibot merge |
Closes #8850
Adds ability to select specific children of a nested column. The python API mimics pyarrow and the format is
The C++ API takes each path as a vector