-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forward-merge branch-23.08 to branch-23.10 #13784
Merged
ajschmidt8
merged 4 commits into
rapidsai:branch-23.10
from
bdice:branch-23.10-merge-23.08
Jul 28, 2023
Merged
Forward-merge branch-23.08 to branch-23.10 #13784
ajschmidt8
merged 4 commits into
rapidsai:branch-23.10
from
bdice:branch-23.10-merge-23.08
Jul 28, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The plan to support AST based filter predicate pushdown in parquet. This PR adds predicate pushdown on row group filtering. The statistics of columns of each row group are loaded to a device column, and AST filter is applied on min, max of each column to select the row groups to read. The user given AST needs to be converted to another AST to be applied on min, max values of each column ('Statistics AST'). After the row groups are parsed, the user given AST is applied on the output columns to filter any remaining rows in the row groups. New `column_name_reference` is introduced to help the users create AST's that reference columns by name, as the user may or may not have the column indices information before reading. Since AST engine takes only column index reference, a transformation is applied to the user given AST. So, 2 new AST transformation classes are introduced: 1. `named_to_reference_converter` - Converts column name references to column index references 2. `stats_expression_converter` - Converts the above output table filtering AST to 'Statistics AST'. Note: This column_name_reference only supported for predicate pushdown filtering, but not supported for other AST operations such as transform, joins etc. - [x] rapidsai#13472 - [x] Convert column chunk min, max to cudf type column. - [x] Add AST filter interface to parquet reader options - [x] Convert AST to Statistics AST - [x] Apply statistics AST on Stats values to get row_groups - [x] Apply AST as filter on output columns. Depends on rapidsai#13472 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Bradley Dice (https://github.com/bdice) - Ray Douglass (https://github.com/raydouglass) URL: rapidsai#13348
Closes rapidsai#11675 Adds `read_parquet_metadata` to libcudf. The metadata has following information - schema - (type, name, children) - num_rows - num_rowgroups - key-value string metadata in file footer To Reviewers: Request for adding more information in metadata. Refer rapidsai#11214 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) - Divye Gala (https://github.com/divyegala) - Ray Douglass (https://github.com/raydouglass) URL: rapidsai#13663
This PR relaxes cudf's protobuf pinnings to help with compatibility issues. `cudf` uses `protobuf` in two places. The first place `protobuf` is used is at build time, to generate a Python module from a `.proto` file in `python/cudf/cmake/Modules/ProtobufHelpers.cmake`: https://github.com/rapidsai/cudf/blob/f8e5a89e983065e1202f1151dd499bea3102a537/python/cudf/cmake/Modules/ProtobufHelpers.cmake#L16-L17 The second place `protobuf` is used is in the generated file `python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py` which is [imported here](https://github.com/rapidsai/cudf/blob/f8e5a89e983065e1202f1151dd499bea3102a537/python/cudf/cudf/io/orc.py#L14-L16). The generated Python module used at runtime should be compatible with newer versions of `protobuf` than the version used to build the Python module, from my understanding of https://protobuf.dev/support/cross-version-runtime-guarantee/. Therefore, we only require that the runtime pinning of `protobuf` is of the same major version and an equal-or-greater minor version. That allows us to relax this pinning. Follow-up to rapidsai#12864, see that PR for more context. Authors: - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ray Douglass (https://github.com/raydouglass) URL: rapidsai#13770
galipremsagar
approved these changes
Jul 28, 2023
e731598
to
5af5ca8
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #13774