Refactor all AST-related APIs and internals including conditional joins and compute_column #8783

vyasr · 2021-07-19T23:26:28Z

This PR has a large changset, but most of the contents are just shuffling around already existing code to handle a number of previously existing tasks that we delayed to expedite work on our abstract syntax tree evaluator (#5494, #7418) and associated APIs (conditional joins (#8214, #5397) and column computation (#5494)). The changes include:

Decoupling conditional join code from hash joins. Enable AST-based joining #8214 made use of a number of functions defined for hash joins without trying to improve the organization of code into shared files. This PR improves that organization for improved conceptual clarity and faster parallel compilation and recompilation times.
Separating AST parsing logic out from the compute_column API. Properly separating the parsing and evaluation of an expression helps deconvolute the code, reduce recompilation times when any AST-related files are touched, and helps easily delineate public and private APIs.
Hiding AST details. A discussion on the AST PR emphasized that a public AST namespace exposed lots of what should be implementation detail. With this PR we expose just enough for users to be able to construct expressions (a single ast/expressions.hpp header), completely hiding all parsing and evaluation logic. Moving more code from headers to source files further improves the situation and also improves compilation times by reducing the dependence of public header files on private ones (e.g. ast/expressions.hpp no longer requires ast/detail/operators.hpp). The compute_column function has now been moved to transform.hpp.
Simplifying some of the internals. The ast_plan and the linearizer have been combined into one class, the expression_parser. The rename emphasizes the purpose of the class rather than the implementation details, and the combination helps remove a lot of otherwise superfluous accessors to the buffers created during parsing. We have no use case for parsing an expression on the host without moving to the device and the two were already effectively coupled internally, so this change manifests that reality in the class structure.

… outstanding PR comments from PR rapidsai#8214.

…r to recompilation of AST-consuming code when AST internals change.

…iate shared mem story to account for nullability.

…ing.

… exceed cudf limits.

ttnghia · 2021-07-20T19:33:41Z

Wow, the PR is huge!!!
I would argue to break it down into several smaller PRs, so the reviewers are easier to review. Otherwise, the reviewers are difficult to review or reluctant to review, thus the PR will need more time to be reviewed and merged 😄

vyasr · 2021-07-20T21:55:00Z

Wow, the PR is huge!!!
I would argue to break it down into several smaller PRs, so the reviewers are easier to review. Otherwise, the reviewers are difficult to review or reluctant to review, thus the PR will need more time to be reviewed and merged 😄

I agree, this PR is huge. Pretty much this exact mindset of not lumping too many changes into one PR is what led to this big PR because all of the upstream PRs were limited in scope to facilitate review. More than 90% of this PR is just moving code around, not new logic, so I felt more comfortable making a large one and providing the summary in the PR description. That said, I'm happy to try and split this up if @harrism and @hyperbolic2346 would prefer. I probably didn't do a perfect job of committing in a sequence that there's an exact correspondence between a subsets of commits and smaller, self-contained PRs, but it's probably close enough that some amount of cherry-picking and copy-pasting would get me there reasonably quickly.

harrism · 2021-07-20T00:19:30Z

cpp/benchmarks/ast/transform_benchmark.cpp

@@ -119,7 +119,7 @@ static void BM_ast_transform(benchmark::State& state)
  // Execute benchmark
  for (auto _ : state) {
    cuda_event_timer raii(state, true);  // flush_l2_cache = true, stream = 0
-    cudf::ast::compute_column(table, expression_tree_root);
+    cudf::compute_column(table, expression_tree_root);


Without the ast namespace, compute_column is an extremelty generic function name. If this is going to be in the top-level cudf namespace, can the name be improved to better tell users what it does?

cpp/include/cudf/ast/detail/expression_evaluator.cuh

cpp/include/cudf/ast/detail/expression_parser.hpp

harrism · 2021-07-20T22:07:31Z

I don't know enough about AST to do a wonderful review. I started yesterday but then realized this is for 21.10. Just submitted my pending comments. Splitting it up, or providing a guide to what I should focus on would help.

Unfortunately the diffs from github are not helpful in some files -- it is interleaving unrelated code.

vyasr · 2021-07-20T22:15:50Z

I'll give splitting this up a go and follow up.

vyasr · 2021-07-21T18:42:01Z

I think my original commits are atomic enough that I should be able to split this into multiple PRs (although unfortunately not entirely independent, so there will be some required sequence). I'm going to leave this PR as a draft for now so that there's a record of all the relevant work, then I'll close it once all of the subsequent associated PRs are completed and merged.

vyasr · 2021-08-16T18:26:02Z

This changes in this PR have now been incorporated via #8815, #8900, #8930, #8957, #8928, and #9045.

vyasr requested review from a team as code owners July 19, 2021 23:26

vyasr requested review from harrism and hyperbolic2346 and removed request for a team July 19, 2021 23:26

github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jul 19, 2021

vyasr self-assigned this Jul 19, 2021

vyasr added breaking Breaking change improvement Improvement / enhancement to an existing function tech debt labels Jul 19, 2021

vyasr added 18 commits July 20, 2021 09:15

Move single_dispatch_binary_operator closer to the only call site.

153e5c6

Add a new nullable function for table_views and address various minor…

c7e6afb

… outstanding PR comments from PR rapidsai#8214.

Delete lots of unused includes.

0c59157

Move common join utilities to a single file.

0d81295

Move common utilities from the kernels file into common utils.

676a085

Move conditional join kernels to a separate file.

b10c8f3

Rename nested_loop_join.cuh to conditional_loop_join.cuh.

697f182

Move conditional joins to a separate compilation unit.

6738107

Move plan to linearizer from transform.

7ff56cb

Move evaluator into separate header.

c227ed5

Move compute_column detail API to preexisting transform namespace.

e9d0d19

Move compute_column API to preexisting transform namespace.

825a6a3

Move compute_column source to transform.

3609285

Move most conditional join implementation into source file from header.

7518b7e

Remove some more includes.

4b306f6

Move all node APIs requiring components of detail out of public heade…

56df9c0

…r to recompilation of AST-consuming code when AST internals change.

Move public-facing operators to nodes.hpp and remove operators header.

cd937ba

Rename nodes to expressions.

1b4188e

vyasr added 9 commits July 20, 2021 09:18

Rename linearizer to expression_parser.

7b3b793

Combine ast_plan and expression_parser.

8a60400

Remove some now unnecessary APIs.

a25bf13

Fix expression_parser's check of widest data type fitting in intermed…

b9f0ede

…iate shared mem story to account for nullability.

Rename plan variables to parser.

a962818

Make expression data variable names constant.

ee95276

Add missing docstrings.

b5a9680

Move evaluator.cuh to expression_evaluator.cuh for consistency in nam…

1328e16

…ing.

Fix a possible cause of UB and add a check for join output sizes that…

2d25a71

… exceed cudf limits.

vyasr force-pushed the refactor/ast_join branch from d303768 to 2d25a71 Compare July 20, 2021 16:30

harrism requested changes Jul 20, 2021

View reviewed changes

vyasr marked this pull request as draft July 21, 2021 18:40

vyasr mentioned this pull request Jul 21, 2021

Refactor conditional joins #8815

Merged

vyasr added this to the Conditional Joins milestone Jul 26, 2021

vyasr mentioned this pull request Aug 4, 2021

Move compute_column API out of ast namespace #8957

Merged

vyasr closed this Aug 16, 2021

vyasr deleted the refactor/ast_join branch January 14, 2022 17:59

GregoryKimball modified the milestones: Conditional Joins, Expression evaluation Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor all AST-related APIs and internals including conditional joins and compute_column #8783

Refactor all AST-related APIs and internals including conditional joins and compute_column #8783

vyasr commented Jul 19, 2021

ttnghia commented Jul 20, 2021

vyasr commented Jul 20, 2021

harrism Jul 20, 2021

harrism commented Jul 20, 2021

vyasr commented Jul 20, 2021

vyasr commented Jul 21, 2021

vyasr commented Aug 16, 2021

Refactor all AST-related APIs and internals including conditional joins and compute_column #8783

Refactor all AST-related APIs and internals including conditional joins and compute_column #8783

Conversation

vyasr commented Jul 19, 2021

ttnghia commented Jul 20, 2021

vyasr commented Jul 20, 2021

harrism Jul 20, 2021

Choose a reason for hiding this comment

harrism commented Jul 20, 2021

vyasr commented Jul 20, 2021

vyasr commented Jul 21, 2021

vyasr commented Aug 16, 2021