Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward-merge branch-23.08 to branch-23.10 #13784

Merged
merged 4 commits into from
Jul 28, 2023

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Jul 28, 2023

Resolves #13774

karthikeyann and others added 3 commits July 26, 2023 22:41
The plan to support AST based filter predicate pushdown in parquet. This PR adds predicate pushdown on row group filtering. 

The statistics of columns of each row group are loaded to a device column, and AST filter is applied on min, max of each column to select the row groups to read. The user given AST needs to be converted to another AST to be applied on min, max values of each column ('Statistics AST'). After the row groups are parsed, the user given AST is applied on the output columns to filter any remaining rows in the row groups.
New `column_name_reference` is introduced to help the users create AST's that reference columns by name, as the user may or may not have the column indices information before reading. Since AST engine takes only column index reference, a transformation is applied to the user given AST. So, 2 new AST transformation classes are introduced: 
1. `named_to_reference_converter` - Converts column name references to column index references
2. `stats_expression_converter` - Converts the above output table filtering AST to 'Statistics AST'.

Note: This column_name_reference only supported for predicate pushdown filtering, but not supported for other AST operations such as transform, joins etc.

- [x] rapidsai#13472 
- [x] Convert column chunk min, max to cudf type column.
- [x] Add AST filter interface to parquet reader options
- [x] Convert AST to Statistics AST
- [x] Apply statistics AST on Stats values to get row_groups
- [x] Apply AST as filter on output columns.

Depends on rapidsai#13472

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#13348
Closes rapidsai#11675
Adds `read_parquet_metadata` to libcudf.
The metadata has following information
- schema - (type, name, children)
- num_rows
- num_rowgroups
- key-value string metadata in file footer

To Reviewers: Request for adding more information in metadata. Refer rapidsai#11214

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Divye Gala (https://github.com/divyegala)
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#13663
This PR relaxes cudf's protobuf pinnings to help with compatibility issues. `cudf` uses `protobuf` in two places.

The first place `protobuf` is used is at build time, to generate a Python module from a `.proto` file in `python/cudf/cmake/Modules/ProtobufHelpers.cmake`: https://github.com/rapidsai/cudf/blob/f8e5a89e983065e1202f1151dd499bea3102a537/python/cudf/cmake/Modules/ProtobufHelpers.cmake#L16-L17

The second place `protobuf` is used is in the generated file `python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py` which is [imported here](https://github.com/rapidsai/cudf/blob/f8e5a89e983065e1202f1151dd499bea3102a537/python/cudf/cudf/io/orc.py#L14-L16).

The generated Python module used at runtime should be compatible with newer versions of `protobuf` than the version used to build the Python module, from my understanding of https://protobuf.dev/support/cross-version-runtime-guarantee/. Therefore, we only require that the runtime pinning of `protobuf` is of the same major version and an equal-or-greater minor version. That allows us to relax this pinning.

Follow-up to rapidsai#12864, see that PR for more context.

Authors:
  - Bradley Dice (https://github.com/bdice)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#13770
@bdice bdice requested review from a team as code owners July 28, 2023 22:08
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue conda labels Jul 28, 2023
@bdice bdice force-pushed the branch-23.10-merge-23.08 branch from e731598 to 5af5ca8 Compare July 28, 2023 22:10
@ajschmidt8 ajschmidt8 merged commit 7746af4 into rapidsai:branch-23.10 Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants