Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] AST filtering in parquet reader #13348

Merged
Merged
Show file tree
Hide file tree
Changes from 60 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
846ca2e
Read Statistics in parquet Reader, update writer to use Statistics
karthikeyann May 12, 2023
58d5472
fix mistake in detail api signature
karthikeyann May 13, 2023
aeac4a9
add filter (AST) to parquet reader options
karthikeyann May 13, 2023
0412e49
Apply AST as filter on output columns, unit test
karthikeyann May 14, 2023
c2e6589
style check, doxygen check failure fixes
karthikeyann May 14, 2023
dbcc3ce
add filter to required pq reader functions
karthikeyann May 26, 2023
26b124e
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann May 26, 2023
0a0c8d1
cleanup statistics_blob
karthikeyann May 30, 2023
81c2026
add expression_transformer in AST
karthikeyann May 30, 2023
b1e9798
add column chunk statistics based row group filtering
karthikeyann May 30, 2023
2c7d211
example unit test cases for filter AST
karthikeyann May 30, 2023
9589b94
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann May 30, 2023
420edd8
rename column chunk statistics_blob to statistics
karthikeyann May 30, 2023
90ca3eb
style fix clang-format
karthikeyann May 30, 2023
02c3033
style fix
karthikeyann May 30, 2023
013df79
Parse column chunk metadata statistics in parquet reader
karthikeyann May 30, 2023
54f7e80
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jun 1, 2023
03760a3
address review comments
karthikeyann Jun 1, 2023
8515929
Merge branch 'fea-parquet_colchunk_stats' of github.com:karthikeyann/…
karthikeyann Jun 1, 2023
a7d7bd0
Merge branch 'branch-23.08' into fea-parquet_predicate_row_group
karthikeyann Jun 2, 2023
67ac0a5
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jun 9, 2023
031cd83
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jun 10, 2023
d770050
Merge branch 'branch-23.08' into fea-parquet_predicate_row_group
karthikeyann Jun 10, 2023
bdbe925
Merge branch 'fea-parquet_predicate_row_group' of github.com:karthike…
karthikeyann Jun 10, 2023
bb1454d
support all types except compound types (dict, struct, list)
karthikeyann Jun 13, 2023
9e58086
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jun 13, 2023
7654aff
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jun 13, 2023
13b4270
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jun 19, 2023
8f90fb4
add ast::column_name_reference as expression
karthikeyann Jun 20, 2023
ed85a01
Avoid emplace_back empty string to schema_info
karthikeyann Jun 20, 2023
d7cec2b
add populate_metadata in parquet reader::impl
karthikeyann Jun 20, 2023
feb42be
add named_to_reference_converter of parquet filter expression
karthikeyann Jun 20, 2023
49d06d9
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jun 20, 2023
b26cd35
fix string type stats column creation
karthikeyann Jun 22, 2023
a86e153
add typed unit tests for Filter Predicate pushdown
karthikeyann Jun 22, 2023
14b2d82
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jun 22, 2023
5a559d4
update tests
karthikeyann Jun 22, 2023
f8b734f
fix reference invalidation by replacing vector with list
karthikeyann Jun 22, 2023
b82dc88
cleanup using enum
karthikeyann Jun 26, 2023
2fa5363
if no stats or all nulls skip filtering
karthikeyann Jun 27, 2023
c141041
unit tests for corner cases
karthikeyann Jun 27, 2023
9f1dbed
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jun 27, 2023
d735acd
address review comments (@bdice)
karthikeyann Jul 2, 2023
474f31e
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jul 2, 2023
80d9a44
move possibly_null_value to operators.hpp
karthikeyann Jul 6, 2023
13295db
split convert<T> as per storage type specializations
karthikeyann Jul 6, 2023
3e06303
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jul 6, 2023
1d2dcd2
use cudf::is_floating
karthikeyann Jul 6, 2023
f1ecac2
address review commments (@bdice)
karthikeyann Jul 12, 2023
cc5998e
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jul 12, 2023
0d66893
update commment
karthikeyann Jul 12, 2023
989a558
review comments, resolve merge issue
karthikeyann Jul 13, 2023
5d0f592
add anoymous namespace
karthikeyann Jul 14, 2023
ab58f61
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jul 14, 2023
f2e4cbc
style fix
karthikeyann Jul 17, 2023
b8a751b
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jul 17, 2023
b4c1e69
add float nan test
karthikeyann Jul 19, 2023
2de4e1d
add Cython bindings
karthikeyann Jul 19, 2023
3e270c7
style fix
karthikeyann Jul 19, 2023
2f04e5e
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jul 19, 2023
ef2da72
address review comments (bdice)
karthikeyann Jul 21, 2023
18eeb88
Merge branch 'branch-23.08' of github.com:rapidsai/cudf into fea-parq…
karthikeyann Jul 21, 2023
b507b28
Merge branch 'branch-23.08' into fea-parquet_predicate_row_group
karthikeyann Jul 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ outputs:
- test -f $PREFIX/lib/libcudf_identify_stream_usage_mode_testing.so
- test -f $PREFIX/include/cudf/aggregation.hpp
- test -f $PREFIX/include/cudf/ast/detail/expression_parser.hpp
- test -f $PREFIX/include/cudf/ast/detail/expression_transformer.hpp
- test -f $PREFIX/include/cudf/ast/detail/operators.hpp
- test -f $PREFIX/include/cudf/ast/expressions.hpp
- test -f $PREFIX/include/cudf/binaryop.hpp
Expand Down
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,7 @@ add_library(
src/io/parquet/page_enc.cu
src/io/parquet/page_hdr.cu
src/io/parquet/page_string_decode.cu
src/io/parquet/predicate_pushdown.cpp
src/io/parquet/reader.cpp
src/io/parquet/reader_impl.cpp
src/io/parquet/reader_impl_helpers.cpp
Expand Down
27 changes: 8 additions & 19 deletions cpp/include/cudf/ast/detail/expression_parser.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@
*/
#pragma once

#include <cudf/ast/detail/operators.hpp>
#include <cudf/ast/expressions.hpp>
#include <cudf/scalar/scalar_device_view.cuh>
#include <cudf/table/table_view.hpp>
#include <cudf/types.hpp>

#include <thrust/optional.h>
#include <thrust/scan.h>

#include <functional>
Expand Down Expand Up @@ -72,24 +72,6 @@ struct alignas(8) device_data_reference {
}
};

// Type trait for wrapping nullable types in a thrust::optional. Non-nullable
// types are returned as is.
template <typename T, bool has_nulls>
struct possibly_null_value;

template <typename T>
struct possibly_null_value<T, true> {
using type = thrust::optional<T>;
};

template <typename T>
struct possibly_null_value<T, false> {
using type = T;
};

template <typename T, bool has_nulls>
using possibly_null_value_t = typename possibly_null_value<T, has_nulls>::type;

// Type used for intermediate storage in expression evaluation.
template <bool has_nulls>
using IntermediateDataType = possibly_null_value_t<std::int64_t, has_nulls>;
Expand Down Expand Up @@ -193,6 +175,13 @@ class expression_parser {
*/
cudf::size_type visit(operation const& expr);

/**
* @brief Visit a column name reference expression.
*
* @param expr Column name reference expression.
* @return cudf::size_type Index of device data reference for the expression.
*/
cudf::size_type visit(column_name_reference const& expr);
/**
* @brief Internal class used to track the utilization of intermediate storage locations.
*
Expand Down
64 changes: 64 additions & 0 deletions cpp/include/cudf/ast/detail/expression_transformer.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@

/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/ast/expressions.hpp>

namespace cudf::ast::detail {
/**
* @brief Base "visitor" pattern class with the `expression` class for expression transformer.
*
* This class can be used to implement recursive traversal of AST tree, and used to validate or
* translate an AST expression.
*/
class expression_transformer {
public:
/**
* @brief Visit a literal expression.
*
* @param expr Literal expression
* @return Reference wrapper of transformed expression
*/
virtual std::reference_wrapper<expression const> visit(literal const& expr) = 0;

/**
* @brief Visit a column reference expression.
*
* @param expr Column reference expression
* @return Reference wrapper of transformed expression
*/
virtual std::reference_wrapper<expression const> visit(column_reference const& expr) = 0;

/**
* @brief Visit an expression expression
*
* @param expr Expression expression
* @return Reference wrapper of transformed expression
*/
virtual std::reference_wrapper<expression const> visit(operation const& expr) = 0;

/**
* @brief Visit a column name reference expression.
*
* @param expr Column name reference expression
* @return Reference wrapper of transformed expression
*/
virtual std::reference_wrapper<expression const> visit(column_name_reference const& expr) = 0;

virtual ~expression_transformer() {}
};
} // namespace cudf::ast::detail
20 changes: 20 additions & 0 deletions cpp/include/cudf/ast/detail/operators.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@
#include <cudf/utilities/error.hpp>
#include <cudf/utilities/type_dispatcher.hpp>

#include <thrust/optional.h>

#include <cuda/std/type_traits>

#include <cmath>
Expand All @@ -33,6 +35,24 @@ namespace ast {

namespace detail {

// Type trait for wrapping nullable types in a thrust::optional. Non-nullable
// types are returned as is.
template <typename T, bool has_nulls>
struct possibly_null_value;

template <typename T>
struct possibly_null_value<T, true> {
using type = thrust::optional<T>;
};

template <typename T>
struct possibly_null_value<T, false> {
using type = T;
};

template <typename T, bool has_nulls>
using possibly_null_value_t = typename possibly_null_value<T, has_nulls>::type;

// Traits for valid operator / type combinations
template <typename Op, typename LHS, typename RHS>
constexpr bool is_valid_binary_op = cuda::std::is_invocable_v<Op, LHS, RHS>;
Expand Down
72 changes: 71 additions & 1 deletion cpp/include/cudf/ast/expressions.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ namespace ast {
// Forward declaration.
namespace detail {
class expression_parser;
}
class expression_transformer;
} // namespace detail

/**
* @brief A generic expression that can be evaluated to return a value.
Expand All @@ -46,6 +47,15 @@ struct expression {
*/
virtual cudf::size_type accept(detail::expression_parser& visitor) const = 0;

/**
* @brief Accepts a visitor class.
*
* @param visitor The `expression_transformer` transforming this expression tree
* @return Reference wrapper of transformed expression
*/
virtual std::reference_wrapper<expression const> accept(
detail::expression_transformer& visitor) const = 0;

/**
* @brief Returns true if the expression may evaluate to null.
*
Expand Down Expand Up @@ -305,6 +315,12 @@ class literal : public expression {
*/
cudf::size_type accept(detail::expression_parser& visitor) const override;

/**
* @copydoc expression::accept
*/
std::reference_wrapper<expression const> accept(
detail::expression_transformer& visitor) const override;

[[nodiscard]] bool may_evaluate_null(table_view const& left,
table_view const& right,
rmm::cuda_stream_view stream) const override
Expand Down Expand Up @@ -398,6 +414,12 @@ class column_reference : public expression {
*/
cudf::size_type accept(detail::expression_parser& visitor) const override;

/**
* @copydoc expression::accept
*/
std::reference_wrapper<expression const> accept(
detail::expression_transformer& visitor) const override;

[[nodiscard]] bool may_evaluate_null(table_view const& left,
table_view const& right,
rmm::cuda_stream_view stream) const override
Expand Down Expand Up @@ -458,6 +480,12 @@ class operation : public expression {
*/
cudf::size_type accept(detail::expression_parser& visitor) const override;

/**
* @copydoc expression::accept
*/
std::reference_wrapper<expression const> accept(
detail::expression_transformer& visitor) const override;

[[nodiscard]] bool may_evaluate_null(table_view const& left,
table_view const& right,
rmm::cuda_stream_view stream) const override
Expand All @@ -474,6 +502,48 @@ class operation : public expression {
std::vector<std::reference_wrapper<expression const>> const operands;
};

/**
* @brief A expression referring to data from a column in a table.
*/
class column_name_reference : public expression {
bdice marked this conversation as resolved.
Show resolved Hide resolved
public:
/**
* @brief Construct a new column name reference object
*
* @param column_name Name of this column in the table metadata (provided when the expression is
* evaluated).
*/
column_name_reference(std::string column_name) : column_name(std::move(column_name)) {}

/**
* @brief Get the column name.
*
* @return The name of this column reference
*/
[[nodiscard]] std::string get_column_name() const { return column_name; }

/**
* @copydoc expression::accept
*/
cudf::size_type accept(detail::expression_parser& visitor) const override;

/**
* @copydoc expression::accept
*/
std::reference_wrapper<expression const> accept(
detail::expression_transformer& visitor) const override;

[[nodiscard]] bool may_evaluate_null(table_view const& left,
table_view const& right,
rmm::cuda_stream_view stream) const override
{
return true;
}

private:
std::string column_name;
};

} // namespace ast

} // namespace cudf
4 changes: 2 additions & 2 deletions cpp/include/cudf/detail/transform.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ std::unique_ptr<column> transform(column_view const& input,
*
* @param stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<column> compute_column(table_view const table,
ast::operation const& expr,
std::unique_ptr<column> compute_column(table_view const& table,
ast::expression const& expr,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr);

Expand Down
30 changes: 30 additions & 0 deletions cpp/include/cudf/io/parquet.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

#pragma once

#include <cudf/ast/expressions.hpp>
#include <cudf/io/detail/parquet.hpp>
#include <cudf/io/types.hpp>
#include <cudf/table/table_view.hpp>
Expand Down Expand Up @@ -62,6 +63,9 @@ class parquet_reader_options {
// Number of rows to read; `nullopt` is all
std::optional<size_type> _num_rows;

// Predicate filter as AST to filter output rows.
std::optional<std::reference_wrapper<ast::expression const>> _filter;

// Whether to store string data as categorical type
bool _convert_strings_to_categories = false;
// Whether to use PANDAS metadata to load columns
Expand Down Expand Up @@ -160,6 +164,13 @@ class parquet_reader_options {
*/
[[nodiscard]] auto const& get_row_groups() const { return _row_groups; }

/**
* @brief Returns AST based filter for predicate pushdown.
*
* @return AST expression to use as filter
*/
[[nodiscard]] auto const& get_filter() const { return _filter; }

/**
* @brief Returns timestamp type used to cast timestamp columns.
*
Expand All @@ -181,6 +192,13 @@ class parquet_reader_options {
*/
void set_row_groups(std::vector<std::vector<size_type>> row_groups);

/**
* @brief Sets AST based filter for predicate pushdown.
*
* @param filter AST expression to use as filter
*/
void set_filter(ast::expression const& filter) { _filter = filter; }

/**
* @brief Sets to enable/disable conversion of strings to categories.
*
Expand Down Expand Up @@ -273,6 +291,18 @@ class parquet_reader_options_builder {
return *this;
}

/**
* @brief Sets vector of individual row groups to read.
*
* @param filter Vector of row groups to read
* @return this for chaining
*/
parquet_reader_options_builder& filter(ast::expression const& filter)
{
options.set_filter(filter);
return *this;
}

/**
* @brief Sets enable/disable conversion of strings to categories.
*
Expand Down
12 changes: 11 additions & 1 deletion cpp/src/ast/expression_parser.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020-2021, NVIDIA CORPORATION.
* Copyright (c) 2020-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -193,6 +193,16 @@ cudf::size_type expression_parser::visit(operation const& expr)
return index;
}

// TODO: Eliminate column name references from expression_parser because
// 2 code paths diverge in supporting column name references:
// 1. column name references are specific to cuIO
// 2. column name references are not supported in the libcudf table operations such as join,
// transform.
cudf::size_type expression_parser::visit(column_name_reference const& expr)
{
CUDF_FAIL("Column name references are not supported in the AST expression parser.");
}

cudf::data_type expression_parser::output_type() const
{
return _data_references.empty() ? cudf::data_type(cudf::type_id::EMPTY)
Expand Down
Loading