Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor all AST-related APIs and internals including conditional joins and compute_column #8783

Closed
wants to merge 27 commits into from

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Jul 19, 2021

This PR has a large changset, but most of the contents are just shuffling around already existing code to handle a number of previously existing tasks that we delayed to expedite work on our abstract syntax tree evaluator (#5494, #7418) and associated APIs (conditional joins (#8214, #5397) and column computation (#5494)). The changes include:

  1. Decoupling conditional join code from hash joins. Enable AST-based joining #8214 made use of a number of functions defined for hash joins without trying to improve the organization of code into shared files. This PR improves that organization for improved conceptual clarity and faster parallel compilation and recompilation times.
  2. Separating AST parsing logic out from the compute_column API. Properly separating the parsing and evaluation of an expression helps deconvolute the code, reduce recompilation times when any AST-related files are touched, and helps easily delineate public and private APIs.
  3. Hiding AST details. A discussion on the AST PR emphasized that a public AST namespace exposed lots of what should be implementation detail. With this PR we expose just enough for users to be able to construct expressions (a single ast/expressions.hpp header), completely hiding all parsing and evaluation logic. Moving more code from headers to source files further improves the situation and also improves compilation times by reducing the dependence of public header files on private ones (e.g. ast/expressions.hpp no longer requires ast/detail/operators.hpp). The compute_column function has now been moved to transform.hpp.
  4. Simplifying some of the internals. The ast_plan and the linearizer have been combined into one class, the expression_parser. The rename emphasizes the purpose of the class rather than the implementation details, and the combination helps remove a lot of otherwise superfluous accessors to the buffers created during parsing. We have no use case for parsing an expression on the host without moving to the device and the two were already effectively coupled internally, so this change manifests that reality in the class structure.

@vyasr vyasr requested review from a team as code owners July 19, 2021 23:26
@vyasr vyasr requested review from harrism and hyperbolic2346 and removed request for a team July 19, 2021 23:26
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jul 19, 2021
@vyasr vyasr self-assigned this Jul 19, 2021
@vyasr vyasr added breaking Breaking change improvement Improvement / enhancement to an existing function tech debt labels Jul 19, 2021
@vyasr vyasr force-pushed the refactor/ast_join branch from d303768 to 2d25a71 Compare July 20, 2021 16:30
@ttnghia
Copy link
Contributor

ttnghia commented Jul 20, 2021

Wow, the PR is huge!!!
I would argue to break it down into several smaller PRs, so the reviewers are easier to review. Otherwise, the reviewers are difficult to review or reluctant to review, thus the PR will need more time to be reviewed and merged 😄

@vyasr
Copy link
Contributor Author

vyasr commented Jul 20, 2021

Wow, the PR is huge!!!
I would argue to break it down into several smaller PRs, so the reviewers are easier to review. Otherwise, the reviewers are difficult to review or reluctant to review, thus the PR will need more time to be reviewed and merged 😄

I agree, this PR is huge. Pretty much this exact mindset of not lumping too many changes into one PR is what led to this big PR because all of the upstream PRs were limited in scope to facilitate review. More than 90% of this PR is just moving code around, not new logic, so I felt more comfortable making a large one and providing the summary in the PR description. That said, I'm happy to try and split this up if @harrism and @hyperbolic2346 would prefer. I probably didn't do a perfect job of committing in a sequence that there's an exact correspondence between a subsets of commits and smaller, self-contained PRs, but it's probably close enough that some amount of cherry-picking and copy-pasting would get me there reasonably quickly.

@@ -119,7 +119,7 @@ static void BM_ast_transform(benchmark::State& state)
// Execute benchmark
for (auto _ : state) {
cuda_event_timer raii(state, true); // flush_l2_cache = true, stream = 0
cudf::ast::compute_column(table, expression_tree_root);
cudf::compute_column(table, expression_tree_root);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the ast namespace, compute_column is an extremelty generic function name. If this is going to be in the top-level cudf namespace, can the name be improved to better tell users what it does?

@harrism
Copy link
Member

harrism commented Jul 20, 2021

I don't know enough about AST to do a wonderful review. I started yesterday but then realized this is for 21.10. Just submitted my pending comments. Splitting it up, or providing a guide to what I should focus on would help.

Unfortunately the diffs from github are not helpful in some files -- it is interleaving unrelated code.

@vyasr
Copy link
Contributor Author

vyasr commented Jul 20, 2021

I'll give splitting this up a go and follow up.

@vyasr vyasr marked this pull request as draft July 21, 2021 18:40
@vyasr
Copy link
Contributor Author

vyasr commented Jul 21, 2021

I think my original commits are atomic enough that I should be able to split this into multiple PRs (although unfortunately not entirely independent, so there will be some required sequence). I'm going to leave this PR as a draft for now so that there's a record of all the relevant work, then I'll close it once all of the subsequent associated PRs are completed and merged.

@vyasr
Copy link
Contributor Author

vyasr commented Aug 16, 2021

This changes in this PR have now been incorporated via #8815, #8900, #8930, #8957, #8928, and #9045.

@vyasr vyasr closed this Aug 16, 2021
@vyasr vyasr deleted the refactor/ast_join branch January 14, 2022 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants