Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parquet predicate filtering with column projection #15113

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Feb 21, 2024

Description

Fixes #15051

The predicate filtering in parquet did not work while column projection is used. This PR fixes that limitation.

With this PR change, the user will be able to use both column name reference and column index reference in the filter.

  • column name reference: the filters may specify any columns by name even if they are not present in column projection.
  • column reference (index): The indices used should be the indices of output columns in the requested order.

This is achieved by extracting column names from filter and add to output buffers, after predicate filtering is done, these filter-only columns are removed and only requested columns are returned.
The change includes reading only output columns' statistics data instead of all root columns.

Summary of changes:

  • get_column_names_in_expression extracts column names in filter.
  • The extra columns in filter are added to output buffers during reader initialization
    • cpp/src/io/parquet/reader_impl_helpers.cpp, cpp/src/io/parquet/reader_impl.cpp
  • instead of extracting statistics data of all root columns, it extracts for only output columns (including columns in filter)
    • cpp/src/io/parquet/predicate_pushdown.cpp
    • To do this, output column schemas and its dtypes should be cached.
    • statistics data extraction code is updated to check for schema_idx in row group metadata.
    • No need to convert filter again for all root columns, reuse the passed output columns reference filter.
    • Rest of the code is same.
  • After the output filter predicate is calculated, these filter-only columns are removed
  • moved named_to_reference_converter constructor to cpp, and remove used constructor.
  • small include<> cleanup

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Feb 21, 2024
@GregoryKimball
Copy link
Contributor

This is great! Nice work @karthikeyann. Let's please add a step to drop any filter columns that weren't part of the requested column set before returning the table.

@wence-
Copy link
Contributor

wence- commented Feb 26, 2024

Thanks, like @GregoryKimball it would be nice if the column selection were disjoint from the filtering selection logic in terms of determining what columns are finally returned. So that I can filter on (say) "A" and "B", but return "C" and "D".

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions

cpp/src/io/parquet/reader_impl_helpers.hpp Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/predicate_pushdown.cpp Outdated Show resolved Hide resolved
karthikeyann and others added 4 commits March 1, 2024 14:29
get_columns() need not specify all columns in filter.
columns in filter are read, and discard finally at the output after
filtering.
@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 1, 2024
@karthikeyann karthikeyann marked this pull request as ready for review March 1, 2024 10:04
@karthikeyann karthikeyann requested a review from a team as a code owner March 1, 2024 10:04
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass - small suggestions

cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/predicate_pushdown.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/predicate_pushdown.cpp Outdated Show resolved Hide resolved
@karthikeyann karthikeyann changed the base branch from branch-24.04 to branch-24.06 April 9, 2024 03:07
Copy link

copy-pr-bot bot commented Apr 9, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@karthikeyann
Copy link
Contributor Author

/ok to test

@GregoryKimball
Copy link
Contributor

@etseidl Does this change look like it will work for your application?

@etseidl
Copy link
Contributor

etseidl commented Apr 24, 2024

@etseidl Does this change look like it will work for your application?

Yes, I believe so.

@vuule vuule self-requested a review May 8, 2024 00:23
@karthikeyann karthikeyann requested a review from wence- May 10, 2024 07:41
@karthikeyann
Copy link
Contributor Author

/ok to test

@mhaseeb123 mhaseeb123 self-requested a review May 10, 2024 20:02
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

cpp/include/cudf/io/parquet.hpp Show resolved Hide resolved
Comment on lines +233 to +234
min.set_index(stats_idx, thrust::nullopt, {});
max.set_index(stats_idx, thrust::nullopt, {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (non-blocking): Possibly worthwhile migrating set_index to std::optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_value and min from Statistics struct uses thrust::optional, which is passed here.
https://github.com/rapidsai/cudf/blob/064dd7b02166cc67e882b708d66621bc3fafd70b/cpp/src/io/parquet/parquet.hpp uses thrust::optional everywhere (except at 2 places). Not sure why.
@vuule

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the thrift data structures use thrust::optional over std::optional because some are used on device. I assume these will migrate to cuda::std::optional eventually. #15091 (comment)

cpp/src/io/parquet/reader_impl_helpers.hpp Show resolved Hide resolved
auto read_opts = cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath})
.columns({"col_double", "col_uint32"})
.filter(read_ref_expr);
EXPECT_THROW(cudf::io::read_parquet(read_opts), cudf::logic_error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (non-blocking): should these throw (per #12885) respectively cudf::data_type_error and std::out_of_range?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion. These exceptions are thrown from AST. I will create another PR to fix it in AST code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are two reasons to throw, we need two separate tests

@mhaseeb123
Copy link
Member

/ok to test

@@ -127,6 +123,7 @@ class aggregate_reader_metadata {
int64_t num_rows;
size_type num_row_groups;

std::vector<data_type> _output_types;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep _output_types cached here or can we clear this after the output has been built? Does extract output_dtypes like inside reader::impl::preprocess_file incur a lot of repetitive overhead.

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort. The changes look good. Not a blocker but please confirm if we need to cache output_dtypes at this time or if just constructing it on the fly would be better.

@karthikeyann
Copy link
Contributor Author

Thanks for the effort. The changes look good. Not a blocker but please confirm if we need to cache output_dtypes at this time or if just constructing it on the fly would be better.

Updated to construct on the fly. will work with read_chunk() as well.

@karthikeyann
Copy link
Contributor Author

/ok to test

@karthikeyann
Copy link
Contributor Author

/ok to test

@karthikeyann
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 47ed345 into rapidsai:branch-24.06 May 16, 2024
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Parquet reader cannot use filters in tandem with column projection
6 participants