-
Notifications
You must be signed in to change notification settings - Fork 915
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix parquet predicate filtering with column projection (#15113)
Fixes #15051 The predicate filtering in parquet did not work while column projection is used. This PR fixes that limitation. With this PR change, the user will be able to use both column name reference and column index reference in the filter. - column name reference: the filters may specify any columns by name even if they are not present in column projection. - column reference (index): The indices used should be the indices of output columns in the requested order. This is achieved by extracting column names from filter and add to output buffers, after predicate filtering is done, these filter-only columns are removed and only requested columns are returned. The change includes reading only output columns' statistics data instead of all root columns. Summary of changes: - `get_column_names_in_expression` extracts column names in filter. - The extra columns in filter are added to output buffers during reader initialization - `cpp/src/io/parquet/reader_impl_helpers.cpp`, `cpp/src/io/parquet/reader_impl.cpp` - instead of extracting statistics data of all root columns, it extracts for only output columns (including columns in filter) - `cpp/src/io/parquet/predicate_pushdown.cpp` - To do this, output column schemas and its dtypes should be cached. - statistics data extraction code is updated to check for `schema_idx` in row group metadata. - No need to convert filter again for all root columns, reuse the passed output columns reference filter. - Rest of the code is same. - After the output filter predicate is calculated, these filter-only columns are removed - moved `named_to_reference_converter` constructor to cpp, and remove used constructor. - small include<> cleanup Authors: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Vukasin Milovanovic (https://github.com/vuule) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #15113
- Loading branch information
1 parent
c7fe7fe
commit 47ed345
Showing
9 changed files
with
276 additions
and
65 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.