Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable AST-based joining #8214

Merged
merged 84 commits into from
Jul 14, 2021
Merged

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented May 11, 2021

This PR implements conditional joins using expressions that are decomposed into abstract syntax trees for evaluation. This PR builds on the AST evaluation framework established in #5494 and #7418, but significantly refactors the internals and generalizes them to enable 1) expressions on two tables and 2) operations on nullable columns. This PR uses the nested loop join code created in #5397 for inner joins, but also substantially generalizes that code to enable 1) all types of joins, 2) joins with arbitrary AST expressions rather than just equality, and 3) handling of null values (with user-specified null_equality). A significant chunk of the code is currently out of place, but since this changeset is rather large I've opted not to move things in ways that will make reviewing this PR significantly more challenging. I will make a follow-up to address those issues once this PR is merged.

@github-actions github-actions bot added conda libcudf Affects libcudf (C++/CUDA) code. labels May 11, 2021
@vyasr vyasr added feature request New feature or request non-breaking Non-breaking change labels May 11, 2021
@vyasr vyasr changed the base branch from branch-0.20 to branch-0.19 May 11, 2021 18:29
@vyasr vyasr changed the base branch from branch-0.19 to branch-0.20 May 11, 2021 18:30
@github-actions github-actions bot removed the conda label May 11, 2021
@codecov
Copy link

codecov bot commented May 15, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@7823a18). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head f5720e1 differs from pull request most recent head ffcf145. Consider uploading reports for the commit ffcf145 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8214   +/-   ##
===============================================
  Coverage                ?   10.67%           
===============================================
  Files                   ?      109           
  Lines                   ?    18667           
  Branches                ?        0           
===============================================
  Hits                    ?     1993           
  Misses                  ?    16674           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7823a18...ffcf145. Read the comment docs.

@vyasr vyasr changed the title Enable AST-based joining Enable AST-based jo ining May 17, 2021
@vyasr vyasr changed the title Enable AST-based jo ining Enable AST-based joining May 17, 2021
@vyasr vyasr force-pushed the feature/ast_equijoin branch from 25473b3 to 692f801 Compare May 26, 2021 20:11
@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. labels May 26, 2021
@vyasr vyasr changed the base branch from branch-21.06 to branch-21.08 May 26, 2021 20:20
@vyasr vyasr force-pushed the feature/ast_equijoin branch from 692f801 to c77a857 Compare May 26, 2021 23:08
@github-actions github-actions bot removed conda Java Affects Java cuDF API. labels May 26, 2021
@vyasr vyasr force-pushed the feature/ast_equijoin branch from 2d87e06 to bde476e Compare June 7, 2021 16:25
@vyasr
Copy link
Contributor Author

vyasr commented Jun 8, 2021

Notes on performance

Existing APIs

The changes to the AST evaluation have a largely negligible effect on the performance of preexisting code paths. For reference, here are the benchmarks of the compute_column API from before this PR:

--------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                    Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------------------------
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/100000/1/manual_time          0.018 ms        0.044 ms        37157 bytes_per_second=40.354G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/100000/5/manual_time          0.032 ms        0.057 ms        21585 bytes_per_second=69.3189G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/100000/10/manual_time         0.051 ms        0.071 ms        13812 bytes_per_second=79.9436G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/1000000/1/manual_time         0.068 ms        0.089 ms        10439 bytes_per_second=109.384G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/1000000/5/manual_time         0.207 ms        0.228 ms         3426 bytes_per_second=107.89G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/1000000/10/manual_time        0.386 ms        0.407 ms         1822 bytes_per_second=106.072G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/10000000/1/manual_time        0.592 ms        0.617 ms         1261 bytes_per_second=125.837G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/10000000/5/manual_time         2.01 ms         2.04 ms          355 bytes_per_second=110.928G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/10000000/10/manual_time        3.83 ms         3.85 ms          186 bytes_per_second=107.075G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/100000000/1/manual_time        5.94 ms         5.97 ms          129 bytes_per_second=125.373G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/100000000/5/manual_time        20.2 ms         20.3 ms           31 bytes_per_second=110.39G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/100000000/10/manual_time       44.9 ms         45.0 ms           15 bytes_per_second=91.1891G/s

And here are the benchmarks on this branch

--------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                    Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------------------------
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/100000/1/manual_time          0.024 ms        0.049 ms        29031 bytes_per_second=31.4297G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/100000/5/manual_time          0.038 ms        0.063 ms        18185 bytes_per_second=58.1815G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/100000/10/manual_time         0.058 ms        0.077 ms        12201 bytes_per_second=71.0981G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/1000000/1/manual_time         0.073 ms        0.094 ms         9611 bytes_per_second=101.435G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/1000000/5/manual_time         0.210 ms        0.232 ms         3352 bytes_per_second=106.268G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/1000000/10/manual_time        0.390 ms        0.412 ms         1798 bytes_per_second=104.968G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/10000000/1/manual_time        0.600 ms        0.626 ms         1241 bytes_per_second=124.215G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/10000000/5/manual_time         2.04 ms         2.06 ms          354 bytes_per_second=109.671G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/10000000/10/manual_time        3.87 ms         3.89 ms          184 bytes_per_second=105.911G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/100000000/1/manual_time        6.01 ms         6.04 ms          127 bytes_per_second=123.929G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/100000000/5/manual_time        20.5 ms         20.5 ms           35 bytes_per_second=109.258G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false>/ast_int32_imbalanced_unique/100000000/10/manual_time       44.9 ms         44.9 ms           15 bytes_per_second=91.3625G/s

There is a small but reproducible additional fixed cost coming from some of these changes (mostly from some of the abstractions I introduced like the dev_ast_plan to reduce passing raw pointers around), but these changes amount to adding microseconds that are only visible for small data sizes like the first few rows.

New join code

With respect to the join code, using the benchmarks posted on #5397 (the benchmarks vs hash join) as a baseline we see roughly a 4x slowdown associated with joining using the AST as compared to using a raw equality comparison. Here are the numbers copy-pasted from that PR:

------------------------------------------------------------------------------------------------------------
Nested Join Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------
Join<int32_t, int32_t>/join_32bit/100000/100000/manual_time              231 ms          231 ms            3
Join<int32_t, int32_t>/join_32bit/100000/400000/manual_time              823 ms          823 ms            1
Join<int32_t, int32_t>/join_32bit/100000/1000000/manual_time            1909 ms         1908 ms            1

And here are the numbers now.

----------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/100000/manual_time             963 ms          963 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/400000/manual_time            3344 ms         3344 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/1000000/manual_time           8327 ms         8326 ms            1

I haven't done any rigorous analysis yet, but offhand this performance hit doesn't seem terribly surprising. There are a number of potential contributing factors:

  1. We need to evaluate the expression twice (once when computing the output join size and once when actually joining), so we incur the additional overhead of the expression evaluation relative to a raw equality check twice.
  2. The kernel now has increased use of shared memory because it's being split between the join cache and the AST evaluator.
  3. The additional logic for the AST probably increases register usage and therefore decreases occupancy
  4. The old code only supported inner and left joins, whereas now there's some extra logic to handle other join types.

All told, a 4x performance hit for those reasons doesn't seem unreasonable.

@vyasr
Copy link
Contributor Author

vyasr commented Jun 8, 2021

@jrhemstad the issue I raised at the meeting today is in the expression_result class template in transform.cuh. Basically there are two issues that I see:

  1. This template is currently defined as an owning container of a possibly nullable scalar, but it is specialized for column views to be a non-owning container. The best option for consistency would be to force the users to create the value type of the base implementation and pass it by reference to the constructor, but it's hard to justify this choice on any other grounds. The user would have to query the appropriate type from the namespace of the class (e.g. expression_result<true, int>::ValueType), and ultimately they would be constructing an object expression_result<has_bools, T>(expression_result<true, int>::ValueType obj) solely for the purpose of passing an argument with the necessary API to expression_evaluator::evaluate. That seems like a poor reason to require more work from a consumer of the AST API, so I opted for this solution.
  2. The column view specialization makes use of the index in set_value to write data to the correct offset, whereas the base implementation ignores it but offers the same API. The reason I did this is because there are problems with the obvious alternative choices:
    1. We could require the calling code to construct an expression result with a pointer to the exact data location that we want to write to. However, to do that the user would need to know the column's data type to compute the correct offset, and the requisite templating balloons compile time.
    2. We could construct the object with a column and an index and remove the index from set_value. This change would be cleaner overall since then we could stop passing the output index through various stages of the expression_evaluator. However, this option is much slower (4-6x performance hit), most likely because the output index ends up somewhere lower in the memory hierarchy with this approach than when the index is passed as a function argument.

If we choose to leave the code as is, we could use something like CRTP to more clearly delineate an API, but that seems pretty heavy-handed for something internal-facing like this.

EDIT
I changed my mind on my last point and decided CRTP would be reasonable in this case. It is cleaner than just implicitly requiring an interface by usage.

@vyasr vyasr marked this pull request as ready for review June 8, 2021 02:40
@vyasr vyasr requested a review from a team as a code owner June 8, 2021 02:40
@vyasr vyasr requested review from jrhemstad and codereport June 8, 2021 02:40
@vyasr
Copy link
Contributor Author

vyasr commented Jun 8, 2021

A couple additional notes on outstanding tasks.

  1. I added benchmarks of the code paths for null columns for compute_column. The memory throughput is significantly lower than for the non-null code paths. The difference may simply be due to the additional overhead of passing around a thrust::optional and whatever effects result from that (reduced occupancy etc), but it may also be related to [PERF] AST kernel has nonzero stack frame #5902. I haven't tried inspecting the PTX yet to investigate further. I don't think this issue needs to be a blocker for merging this PR, but it would be very nice to figure out.
  2. I also add nullable benchmarks for the conditional joins. These benchmarks work fine for smaller data sizes, but I run into a cudf::cuda_error once the data size is large enough. I am unable to reproduce this error for non-nullable data with even larger data sizes, and I am also unable to reproduce this with compute_column for large nullable data (although it's possible that I just haven't gone large enough yet). Those two things lead me to believe that the most likely culprit is somewhere in the join code itself, perhaps w.r.t. its usage of shared memory or the way the expression result data is passed through the code. This problem should probably be resolved before this PR can be finalized.

EDIT 6/18
I've fixed the second problem above, which IMO was the only hard blocker for merging. I'll work on profiling more, but whenever reviewers are happy I think we can merge this and work on improving performance incrementally (if no obvious bottlenecks are exposed in review).

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyasr Thanks for pinging me, it's great to see what you did with this PR. This looks like a good architecture, overall. I caught one bug where left should be right. I put in some minor comments on docstrings and constness. I probably will not have time to read this again in the next 2-3 weeks, so I'm submitting this PR review as a comment rather than requesting changes/approving.

cpp/include/cudf/ast/detail/linearizer.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/ast/detail/linearizer.hpp Show resolved Hide resolved
cpp/include/cudf/ast/detail/transform.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/ast/detail/transform.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/ast/detail/transform.cuh Outdated Show resolved Hide resolved
cpp/src/join/nested_loop_join.cuh Outdated Show resolved Hide resolved
cpp/src/join/nested_loop_join.cuh Show resolved Hide resolved
cpp/src/join/nested_loop_join.cuh Outdated Show resolved Hide resolved
cpp/src/join/nested_loop_join.cuh Show resolved Hide resolved
cpp/src/join/nested_loop_join.cuh Show resolved Hide resolved
Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple small comments while reviewing

cpp/include/cudf/ast/detail/linearizer.hpp Show resolved Hide resolved
cpp/include/cudf/ast/detail/linearizer.hpp Show resolved Hide resolved
Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All makes sense now. Great work.

table_view left,
table_view right,
ast::expression binary_predicate,
null_equality compare_nulls = null_equality::EQUAL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the caller can specify the join condition via the AST, I'm surprised to see this parameter here. Instead I would expect there to be either two types of AST equality expressions or a parameterized AST equality operator to control the null behavior. The caller would then build the appropriate type of AST to match the null handling they need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we were just talking about this earlier today. It quickly devolved into asking whether or not we need two versions of every operator that has different null behavior. We kept the null_equality for now to be consistent with the other join APIs, but I agree it is weird to have it and we should do something better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad and I had a similar discussion earlier today, and I think we decided moving in that direction makes sense as well.

Copy link
Contributor Author

@vyasr vyasr Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only operators for which we might want this are = and !=, right? I can't imagine reasonable semantics for any other operator. != is borderline, but I can imagine that other tools might try to define different semantics for NULL != NULL, I just don't know what those are for consumers of this API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect it will only be necessary for ==. That being said, in the future it could be very useful in an AST expression to conditionally branch based on the result of an "is this null" operator. Then users could implement their own fancy-pants versions of null comparison or other special-case null handling in their custom expressions.

Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 👍

@jrhemstad
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 397bf0a into rapidsai:branch-21.08 Jul 14, 2021
@vyasr
Copy link
Contributor Author

vyasr commented Jul 15, 2021

@jrhemstad @codereport I did some more in-depth profiling against the old nested loop join and got back some useful information, as well as some good news. As expected, the biggest differences are:

  1. Significantly higher L1 memory traffic (56 vs 31 G, an 80 % increase) due to the increased internal state of evaluating the condition.
  2. Substantially higher register pressure (94 vs 40 reg/thread, a 135% increase) again due to the increased internal state for evaluation.
  3. Reduced occupancy due to the increased register usage. On an RTX 8000 I see that the theoretical occupancy is only 62.5% (20 resident warps per SM vs the maximum, 32, which is achieve by the raw nested loop join), entirely due to the high register usage limiting the occupancy.
  4. Increased latency due to lower occupancy.

Here's a PDF of the nsight-compute comparison profile if you'd like to see it.

The good news is that in benchmarking the nested loop join code I realized I was making the incorrect assumption that I was using comparable hardware to what was used for the benchmarks on #5397. It turns out the the benchmarks I posted above on this PR were actually on a slower GPU (Tesla T4) than I suspect was used for the benchmarks on #5397 (something comparable to an RTX 8000, which I now have benchmarks on). For comparison, here are the results on a Tesla T4:

------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                          Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/100000/manual_time         976 ms          976 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/400000/manual_time        3329 ms         3329 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/1000000/manual_time       8287 ms         8286 ms            1
ConditionalJoin<int32_t, int32_t>/nested_inner_join_32bit/100000/100000/manual_time              495 ms          495 ms            2
ConditionalJoin<int32_t, int32_t>/nested_inner_join_32bit/100000/400000/manual_time             1924 ms         1924 ms            1
ConditionalJoin<int32_t, int32_t>/nested_inner_join_32bit/100000/1000000/manual_time            4778 ms         4778 ms            1

and here are the results on an RTX 8000

------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                          Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/100000/manual_time         433 ms          433 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/400000/manual_time        1401 ms         1401 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/1000000/manual_time       3452 ms         3452 ms            1
ConditionalJoin<int32_t, int32_t>/nested_inner_join_32bit/100000/100000/manual_time              212 ms          212 ms            3
ConditionalJoin<int32_t, int32_t>/nested_inner_join_32bit/100000/400000/manual_time              809 ms          809 ms            1
ConditionalJoin<int32_t, int32_t>/nested_inner_join_32bit/100000/1000000/manual_time            1999 ms         1999 ms            1

As you can see, the original nested join loop benchmark is nearly identical to the RTX 8000 one, whereas my conditional join benchmarks have been run on the T4. Evidently the slowdown is ~2x for the 100k * 100k join, and actually meaningfully less than 2x for the larger 100k * 1MM join, so we are in better shape than I had originally reported.

vyasr added a commit to vyasr/cudf that referenced this pull request Jul 20, 2021
vyasr added a commit to vyasr/cudf that referenced this pull request Jul 21, 2021
rapids-bot bot pushed a commit that referenced this pull request Jul 28, 2021
#8214 made use of a number of functions defined for hash joins, but to expedite the review process no attempt was made to improve the organization of code into shared files. This PR improves that organization for improved conceptual clarity and faster parallel compilation and recompilation times. As a result, most of the changes in this PR are simply preexisting code to new files, as well as removing a number of unnecessary headers. The main new code is a few minor improvements that were suggested but not implemented on #8214 are included, mainly 1) a new `nullable(table_view const&)` API for checking if any columns in a table are nullable (requested [here](#8214 (comment))), and 2) some extra comments on the behavior of conditional joins when the condition returns null (requested [here](#8214 (comment))).

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #8815
@vyasr vyasr deleted the feature/ast_equijoin branch January 14, 2022 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants