Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable AST-based joining #8214

Merged
merged 84 commits into from
Jul 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
38b01ec
Add scaffolding for new predicated join.
vyasr May 7, 2021
fe976e2
Pass a memory resource around where appropriate.
vyasr May 7, 2021
e3e2d5f
Add new code paths for expression-based joining to follow.
vyasr May 7, 2021
3e84671
Update AST evaluator infrastructure with copied classes that support …
vyasr May 11, 2021
1426eed
Use AST to actually evaluate the join predicate.
vyasr May 11, 2021
95a06a0
Remove extraneous equality comparator.
vyasr May 11, 2021
4a76131
Add various tests and isolate one small bug.
vyasr May 11, 2021
cd913ce
Use table references to indicate which table to use as input.
vyasr May 11, 2021
1051969
Stop passing extra indexes through.
vyasr May 11, 2021
fbab8b1
Enable unary operators.
vyasr May 11, 2021
449286b
Unify row_* and two_table_* classes.
vyasr May 11, 2021
990809c
Fully enable unary ops in joins.
vyasr May 11, 2021
fe35939
Convert the evaluate functions into methods of the evaluator.
vyasr May 14, 2021
171775d
Reroute evaluate_row_expression through evaluate_join expression and …
vyasr May 14, 2021
d3f908a
Remove sizes and pointers from ast_plan and make them constructor-local.
vyasr May 15, 2021
1e4db5a
Store a reference to the plan directly in the evaluator.
vyasr May 15, 2021
55e4c57
Make ast_plan a trivial class.
vyasr May 15, 2021
6ed4902
Create new dev plan struct to isolate just the device pointers to pro…
vyasr May 15, 2021
47b6cf2
Make two_table_output and its subclasses nested classes of the evalua…
vyasr May 17, 2021
922b6fc
Rename two_table to expression.
vyasr May 17, 2021
ebba0f2
Rename evaluate_row_expression and evaluate_join_expression to overlo…
vyasr May 17, 2021
548da10
Remove duplicate join functions and rename.
vyasr May 17, 2021
d6f6a3a
Construct linearizer as part of ast_plan.
vyasr May 21, 2021
c9863fb
Allow linearizer to handle left and right tables.
vyasr May 21, 2021
bce90fa
Error on non-boolean expressions.
vyasr May 21, 2021
8df7076
Template expression_evaluator on the output type.
vyasr May 21, 2021
fa3a543
Template the behavior of resolve_output.
vyasr May 21, 2021
5e6eaeb
Use new output type for boolean test.
vyasr May 21, 2021
41b5548
Update join size estimation to use the expression.
vyasr May 21, 2021
6455a36
Compute explicit join size and simplify join code by removing estimat…
vyasr May 21, 2021
3407133
Some minor cleanup.
vyasr May 21, 2021
5d6777b
Rename estimate function.
vyasr May 21, 2021
ff4446b
Clean up a few TODO items.
vyasr May 21, 2021
71a3405
Various minor cleanup tasks, moving shmem calcs into dev plan.
vyasr May 24, 2021
5fd48e2
Move conditional join tests into a separate file, use fixtures to cle…
vyasr May 26, 2021
0bc9228
Rename predicate join to conditional join.
vyasr May 25, 2021
bd875ab
Enable inner and left joins separately and start adding left join tests.
vyasr May 27, 2021
241e5ab
Standardize grid layout for computing output size.
vyasr May 27, 2021
a74c85e
Inline computation of join output size.
vyasr May 27, 2021
ff2ebfe
Add appropriate filtering for all types of joins throughout the code.
vyasr May 27, 2021
082b33c
Refactor tests to use namespace constants for references to make it e…
vyasr May 27, 2021
c65a79f
Enable left semi and left anti joins.
vyasr May 27, 2021
44b8cb7
Enable full joins.
vyasr May 28, 2021
57d1c14
Restructure test fixtures to reduce duplication.
vyasr May 28, 2021
60fb013
Enable test by comparison to hash join.
vyasr May 28, 2021
cba5566
Move output test to the GPU to speed up tests.
vyasr May 28, 2021
8948ec2
Change test back to actually checking all output pairs rather than va…
vyasr May 28, 2021
d008709
Add comparison to hash join for all join types and fix counting bug f…
vyasr Jun 1, 2021
a339568
Initial version of benchmarks for conditional joins.
vyasr Jun 1, 2021
19520fa
Update benchmarks to run for all join types, restrict data sizes to f…
vyasr Jun 1, 2021
76475a5
Add type trait to handle possibly null values and wrap intermediate s…
vyasr Jun 2, 2021
9be7bd2
Modify plan to compute shared memory allocation based on the presence…
vyasr Jun 2, 2021
7d3e54c
Add nullability to all kernels so that it's configurable from the cal…
vyasr Jun 2, 2021
d0d247d
Enable nulls for compute_column.
vyasr Jun 2, 2021
1c6cedd
Add some tests of null transforms.
vyasr Jun 2, 2021
0b09b64
Enable nullable columns in conditional joins.
vyasr Jun 3, 2021
8ec2039
Pass output object to evaluate rather than on construction.
vyasr Jun 3, 2021
ed49757
Create new container for output in the AST evaluation.
vyasr Jun 5, 2021
1a76423
Use new container to assign output.
vyasr Jun 5, 2021
0094648
Reorder template arguments to facilitate eventual deduction.
vyasr Jun 5, 2021
799fa2b
Remove template from evaluator and just template the functions.
vyasr Jun 5, 2021
c9cb0f7
Add appropriate const qualifiers.
vyasr Jun 7, 2021
a2852a7
Document all conditional join functions.
vyasr Jun 7, 2021
4e87fdf
Remove index flipping, which does not work for conditional joins.
vyasr Jun 7, 2021
19e0088
Standardize IntermediateDataType type alias.
vyasr Jun 7, 2021
d498c8c
Capture value type by reference rather than pointer to make it more t…
vyasr Jun 7, 2021
d1de764
Convert default value_container to be an owning type.
vyasr Jun 7, 2021
85b098f
Rename some structs for clarity and replace pair with optional everyw…
vyasr Jun 7, 2021
3a2a431
Add thorough documentation.
vyasr Jun 8, 2021
8c60494
Add null equality to all conditional join APIs.
vyasr Jun 8, 2021
342136c
Use compare_nulls in output resolution for the equality operator.
vyasr Jun 8, 2021
94307b2
Remove long TODO.
vyasr Jun 8, 2021
14ecdab
Remove unnecessary includes.
vyasr Jun 8, 2021
216b47f
Use CRTP to enforce a clear API for expression result types.
vyasr Jun 8, 2021
28e26d7
Add benchmarks for nullability.
vyasr Jun 8, 2021
d1dadba
Add nullable benchmarks for transform, and verify that the current is…
vyasr Jun 8, 2021
8705c3d
Fix bug in calculation of dynamic shared memory request.
vyasr Jun 18, 2021
3eb3a24
Fix missing stream sync and properly pass null equality comparison th…
vyasr Jun 18, 2021
92e5353
Merge remote-tracking branch 'origin/branch-21.08' into feature/ast_e…
vyasr Jun 29, 2021
6c9251b
Address PR comments.
vyasr Jun 29, 2021
e1390bc
Merge remote-tracking branch 'origin/branch-21.08' into feature/ast_e…
vyasr Jun 29, 2021
c1f80d6
Run clang-format.
vyasr Jun 29, 2021
500d39d
Merge remote-tracking branch 'origin/branch-21.08' into feature/ast_e…
vyasr Jul 12, 2021
ffcf145
Merge branch 'feature/ast_equijoin' of github.com:vyasr/cudf into fea…
vyasr Jul 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ ConfigureBench(STREAM_COMPACTION_BENCH stream_compaction/drop_duplicates_benchma

###################################################################################################
# - join benchmark --------------------------------------------------------------------------------
ConfigureBench(JOIN_BENCH join/join_benchmark.cu)
ConfigureBench(JOIN_BENCH join/join_benchmark.cu join/conditional_join_benchmark.cu)

###################################################################################################
# - iterator benchmark ----------------------------------------------------------------------------
Expand Down
76 changes: 53 additions & 23 deletions cpp/benchmarks/ast/transform_benchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,21 +30,21 @@
#include <thrust/iterator/counting_iterator.h>

#include <algorithm>
#include <iostream>
#include <list>
#include <numeric>
#include <random>
#include <vector>

enum class TreeType {
IMBALANCED_LEFT // All operator expressions have a left child operator expression and a right
// child column reference
};

template <typename key_type, TreeType tree_type, bool reuse_columns>
template <typename key_type, TreeType tree_type, bool reuse_columns, bool Nullable>
class AST : public cudf::benchmark {
};

template <typename key_type, TreeType tree_type, bool reuse_columns>
template <typename key_type, TreeType tree_type, bool reuse_columns, bool Nullable>
static void BM_ast_transform(benchmark::State& state)
{
const cudf::size_type table_size{(cudf::size_type)state.range(0)};
Expand All @@ -56,10 +56,24 @@ static void BM_ast_transform(benchmark::State& state)
auto columns = std::vector<cudf::column_view>(n_cols);

auto data_iterator = thrust::make_counting_iterator(0);
std::generate_n(column_wrappers.begin(), n_cols, [=]() {
return cudf::test::fixed_width_column_wrapper<key_type>(data_iterator,
data_iterator + table_size);
});

if constexpr (Nullable) {
auto validities = std::vector<bool>(table_size);
std::random_device rd;
std::mt19937 gen(rd());

std::generate(
validities.begin(), validities.end(), [&]() { return gen() > (0.5 * gen.max()); });
std::generate_n(column_wrappers.begin(), n_cols, [=]() {
return cudf::test::fixed_width_column_wrapper<key_type>(
data_iterator, data_iterator + table_size, validities.begin());
});
} else {
std::generate_n(column_wrappers.begin(), n_cols, [=]() {
return cudf::test::fixed_width_column_wrapper<key_type>(data_iterator,
data_iterator + table_size);
});
}
std::transform(
column_wrappers.begin(), column_wrappers.end(), columns.begin(), [](auto const& col) {
return static_cast<cudf::column_view>(col);
Expand Down Expand Up @@ -113,22 +127,23 @@ static void BM_ast_transform(benchmark::State& state)
(tree_levels + 1) * sizeof(key_type));
}

#define AST_TRANSFORM_BENCHMARK_DEFINE(name, key_type, tree_type, reuse_columns) \
BENCHMARK_TEMPLATE_DEFINE_F(AST, name, key_type, tree_type, reuse_columns) \
(::benchmark::State & st) { BM_ast_transform<key_type, tree_type, reuse_columns>(st); }

AST_TRANSFORM_BENCHMARK_DEFINE(ast_int32_imbalanced_unique,
int32_t,
TreeType::IMBALANCED_LEFT,
false);
AST_TRANSFORM_BENCHMARK_DEFINE(ast_int32_imbalanced_reuse,
int32_t,
TreeType::IMBALANCED_LEFT,
true);
AST_TRANSFORM_BENCHMARK_DEFINE(ast_double_imbalanced_unique,
double,
TreeType::IMBALANCED_LEFT,
false);
#define AST_TRANSFORM_BENCHMARK_DEFINE(name, key_type, tree_type, reuse_columns, nullable) \
BENCHMARK_TEMPLATE_DEFINE_F(AST, name, key_type, tree_type, reuse_columns, nullable) \
(::benchmark::State & st) { BM_ast_transform<key_type, tree_type, reuse_columns, nullable>(st); }

AST_TRANSFORM_BENCHMARK_DEFINE(
ast_int32_imbalanced_unique, int32_t, TreeType::IMBALANCED_LEFT, false, false);
AST_TRANSFORM_BENCHMARK_DEFINE(
ast_int32_imbalanced_reuse, int32_t, TreeType::IMBALANCED_LEFT, true, false);
AST_TRANSFORM_BENCHMARK_DEFINE(
ast_double_imbalanced_unique, double, TreeType::IMBALANCED_LEFT, false, false);

AST_TRANSFORM_BENCHMARK_DEFINE(
ast_int32_imbalanced_unique_nulls, int32_t, TreeType::IMBALANCED_LEFT, false, true);
AST_TRANSFORM_BENCHMARK_DEFINE(
ast_int32_imbalanced_reuse_nulls, int32_t, TreeType::IMBALANCED_LEFT, true, true);
AST_TRANSFORM_BENCHMARK_DEFINE(
ast_double_imbalanced_unique_nulls, double, TreeType::IMBALANCED_LEFT, false, true);

static void CustomRanges(benchmark::internal::Benchmark* b)
{
Expand All @@ -155,3 +170,18 @@ BENCHMARK_REGISTER_F(AST, ast_double_imbalanced_unique)
->Apply(CustomRanges)
->Unit(benchmark::kMillisecond)
->UseManualTime();

BENCHMARK_REGISTER_F(AST, ast_int32_imbalanced_unique_nulls)
->Apply(CustomRanges)
->Unit(benchmark::kMillisecond)
->UseManualTime();

BENCHMARK_REGISTER_F(AST, ast_int32_imbalanced_reuse_nulls)
->Apply(CustomRanges)
->Unit(benchmark::kMillisecond)
->UseManualTime();

BENCHMARK_REGISTER_F(AST, ast_double_imbalanced_unique_nulls)
->Apply(CustomRanges)
->Unit(benchmark::kMillisecond)
->UseManualTime();
Loading