Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parquet reader cannot use filters in tandem with column projection #15051

Closed
wence- opened this issue Feb 14, 2024 · 1 comment · Fixed by #15113
Closed

[BUG] Parquet reader cannot use filters in tandem with column projection #15051

wence- opened this issue Feb 14, 2024 · 1 comment · Fixed by #15113
Assignees
Labels
1 - On Deck To be worked on next bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@wence-
Copy link
Contributor

wence- commented Feb 14, 2024

Describe the bug

Reading a column-projected parquet file (that has $N$ on disk columns) and simultaneously applying a filter expression to the read columns is not possible, unless the columns we request to read are the first $[0, \dots, K), K \le N$.

The problem appears to be that the filter expression must (eventually) refer to columns by their index, however one part of the code (that uses row-group filters) expects these indices to match the indices of the columns in the un-projected metadata, whereas another part needs the indices to match the requested projected column indices. It is not possible to arrange this to be the case for either a reordering or a non-dense subset of columns.

Steps/Code to reproduce bug

diff --git a/cpp/tests/io/parquet_reader_test.cpp b/cpp/tests/io/parquet_reader_test.cpp
index abbd0c97f0..1113d6c578 100644
--- a/cpp/tests/io/parquet_reader_test.cpp
+++ b/cpp/tests/io/parquet_reader_test.cpp
@@ -1406,6 +1406,28 @@ TEST_F(ParquetReaderTest, FilterIdentity)
   CUDF_TEST_EXPECT_TABLES_EQUAL(*result.tbl, *result2.tbl);
 }
 
+TEST_F(ParquetReaderTest, FilterWithColumnProjection)
+{
+  auto [src, filepath] = create_parquet_with_stats("FilterWithColumnProjection.parquet");
+  auto val = cudf::numeric_scalar<uint32_t>{10};
+  auto lit = cudf::ast::literal{val};
+  auto col_ref = cudf::ast::column_name_reference{"col_uint32"};
+  auto col_index = cudf::ast::column_reference{0};
+  auto read_expr = cudf::ast::operation(cudf::ast::ast_operator::LESS, col_ref, lit);
+  auto filter_expr = cudf::ast::operation(cudf::ast::ast_operator::LESS, col_index, lit);
+
+  auto predicate = cudf::compute_column(src, filter_expr);
+  auto projected_table = cudf::table_view{{src.get_column(2), src.get_column(0)}};
+  auto expected = cudf::apply_boolean_mask(projected_table, *predicate);
+
+  auto read_opts =
+    cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath})
+     .columns({"col_double", "col_uint32"})
+     .filter(read_expr);
+  auto result = cudf::io::read_parquet(read_opts);
+  CUDF_TEST_EXPECT_TABLES_EQUAL(*result.tbl, *expected);
+}
+
 TEST_F(ParquetReaderTest, FilterReferenceExpression)
 {
   auto [src, filepath] = create_parquet_with_stats("FilterReferenceExpression.parquet");

Build the cudf tests and then run:

$ ./PARQUET_TEST --gtest_filter=ParquetReaderTest.FilterWithColumnProjection
Note: Google Test filter = ParquetReaderTest.FilterWithColumnProjection
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ParquetReaderTest
[ RUN      ] ParquetReaderTest.FilterWithColumnProjection
unknown file: Failure
C++ exception with description "CUDF failure at:/home/coder/cudf/cpp/src/io/parquet/predicate_pushdown.cpp:111: Invalid type and stats combination" thrown in the test body.

[  FAILED  ] ParquetReaderTest.FilterWithColumnProjection (13 ms)
[----------] 1 test from ParquetReaderTest (13 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (13 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ParquetReaderTest.FilterWithColumnProjection

 1 FAILED TEST

Expected behavior

This should work.

@wence- wence- added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Feb 14, 2024
@GregoryKimball GregoryKimball added the 1 - On Deck To be worked on next label Feb 14, 2024
@karthikeyann
Copy link
Contributor

Looking into it.
Issue is with using output schema as schema of parquet while reading statistics data. parquet schema needs to be extracted regardless of output schema.

rapids-bot bot pushed a commit that referenced this issue May 16, 2024
Fixes #15051

The predicate filtering in parquet did not work while column projection is used. This PR fixes that limitation.

With this PR change, the user will be able to use both column name reference and column index reference in the filter.
- column name reference: the filters may specify any columns by name even if they are not present in column projection.
- column reference (index): The indices used should be the indices of output columns in the requested order.

This is achieved by extracting column names from filter and add to output buffers, after predicate filtering is done, these filter-only columns are removed and only requested columns are returned.
The change includes reading only output columns' statistics data instead of all root columns.

Summary of changes:
- `get_column_names_in_expression` extracts column names in filter.
- The extra columns in filter are added to output buffers during reader initialization
  - `cpp/src/io/parquet/reader_impl_helpers.cpp`, `cpp/src/io/parquet/reader_impl.cpp`
- instead of extracting statistics data of all root columns, it extracts for only output columns (including columns in filter)
  - `cpp/src/io/parquet/predicate_pushdown.cpp`
  - To do this, output column schemas and its dtypes should be cached.
  - statistics data extraction code is updated to check for `schema_idx` in row group metadata.
  - No need to convert filter again for all root columns, reuse the passed output columns reference filter. 
  - Rest of the code is same.
- After the output filter predicate is calculated, these filter-only columns are removed
- moved `named_to_reference_converter` constructor to cpp, and remove used constructor.
- small include<> cleanup

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: #15113
@GregoryKimball GregoryKimball removed this from libcudf Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants