Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parquet predicate filtering with column projection #15113

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
be089f3
fix stats filter conversion dtypes and names
karthikeyann Feb 21, 2024
f458410
filter columns limitation fixed.
karthikeyann Mar 1, 2024
b01b2d8
address review comments, added docstring
karthikeyann Mar 1, 2024
b348db4
Merge branch 'branch-24.04' into fix-pq_filter_col_projection
karthikeyann Mar 1, 2024
4a07e3d
add docstring for filter
karthikeyann Mar 1, 2024
6ee2bcf
Merge branch 'branch-24.04' into fix-pq_filter_col_projection
karthikeyann Mar 6, 2024
acb0723
update docs with example
karthikeyann Mar 6, 2024
bff38f5
Merge branch 'fix-pq_filter_col_projection' of github.com:karthikeyan…
karthikeyann Mar 6, 2024
d643ce1
Merge branch 'branch-24.04' into fix-pq_filter_col_projection
karthikeyann Mar 6, 2024
e79552c
Merge branch 'branch-24.06' into fix-pq_filter_col_projection
karthikeyann Apr 9, 2024
e40cffc
address review comments, include cleanup, reorg code
karthikeyann Apr 24, 2024
926a75a
Merge branch 'branch-24.06' into fix-pq_filter_col_projection
karthikeyann Apr 24, 2024
a220d7d
fix col index ref on projection
karthikeyann May 10, 2024
c0e734c
Merge branch 'branch-24.06' into fix-pq_filter_col_projection
karthikeyann May 10, 2024
96ea0e8
Merge branch 'branch-24.06' into fix-pq_filter_col_projection
vuule May 14, 2024
47c5413
Merge branch 'branch-24.06' into fix-pq_filter_col_projection
mhaseeb123 May 15, 2024
9e4008e
remove caching output dtypes
karthikeyann May 16, 2024
cc3bd26
Merge branch 'branch-24.06' into fix-pq_filter_col_projection
karthikeyann May 16, 2024
f64294e
wMerge branch 'fix-pq_filter_col_projection' of github.com:karthikeya…
karthikeyann May 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions cpp/include/cudf/io/parquet.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,10 @@ class parquet_reader_options {
/**
* @brief Sets AST based filter for predicate pushdown.
*
* The filter can utilize cudf::ast::column_name_reference to reference a column by its name,
* even if it's not necessarily present in the requested projected columns.
* To refer to output column indices, you can use cudf::ast::column_reference.
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
*
* @param filter AST expression to use as filter
*/
void set_filter(ast::expression const& filter) { _filter = filter; }
Expand Down Expand Up @@ -292,9 +296,13 @@ class parquet_reader_options_builder {
}

/**
* @brief Sets vector of individual row groups to read.
* @brief Sets AST based filter for predicate pushdown.
*
* @param filter Vector of row groups to read
* The filter can utilize cudf::ast::column_name_reference to reference a column by its name,
* even if it's not necessarily present in the requested projected columns.
* To refer to output column indices, you can use cudf::ast::column_reference.
*
* @param filter AST expression to use as filter
* @return this for chaining
*/
parquet_reader_options_builder& filter(ast::expression const& filter)
Expand Down
34 changes: 25 additions & 9 deletions cpp/tests/io/parquet_reader_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1413,18 +1413,34 @@ TEST_F(ParquetReaderTest, FilterWithColumnProjection)
auto lit = cudf::ast::literal{val};
auto col_ref = cudf::ast::column_name_reference{"col_uint32"};
auto col_index = cudf::ast::column_reference{0};
auto read_expr = cudf::ast::operation(cudf::ast::ast_operator::LESS, col_ref, lit);
auto filter_expr = cudf::ast::operation(cudf::ast::ast_operator::LESS, col_index, lit);

auto predicate = cudf::compute_column(src, filter_expr);
auto projected_table = cudf::table_view{{src.get_column(2)}};
auto expected = cudf::apply_boolean_mask(projected_table, *predicate);
auto predicate = cudf::compute_column(src, filter_expr);

auto read_opts = cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath})
.columns({"col_double"})
.filter(read_expr);
auto result = cudf::io::read_parquet(read_opts);
CUDF_TEST_EXPECT_TABLES_EQUAL(*result.tbl, *expected);
{ // column_name_reference in parquet filter (not present in column projection)
auto read_expr = cudf::ast::operation(cudf::ast::ast_operator::LESS, col_ref, lit);
auto projected_table = cudf::table_view{{src.get_column(2)}};
auto expected = cudf::apply_boolean_mask(projected_table, *predicate);

auto read_opts = cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath})
.columns({"col_double"})
.filter(read_expr);
auto result = cudf::io::read_parquet(read_opts);
CUDF_TEST_EXPECT_TABLES_EQUAL(*result.tbl, *expected);
}

{ // column_reference in parquet filter (indices as per order of column projection)
auto col_index2 = cudf::ast::column_reference{1};
auto read_ref_expr = cudf::ast::operation(cudf::ast::ast_operator::LESS, col_index, lit);

auto projected_table = cudf::table_view{{src.get_column(2), src.get_column(0)}};
auto expected = cudf::apply_boolean_mask(projected_table, *predicate);
auto read_opts = cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath})
.columns({"col_double", "col_uint32"})
.filter(read_ref_expr);
auto result = cudf::io::read_parquet(read_opts);
CUDF_TEST_EXPECT_TABLES_EQUAL(*result.tbl, *expected);
}
}

TEST_F(ParquetReaderTest, FilterReferenceExpression)
Expand Down
Loading