Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve hash join detail functions #10273

Merged

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Feb 11, 2022

This PR includes several changes in the hash join detail functions:

  • Fixes a bug where public join APIs were invoked in detail join functions. External invocations are replaced with the corresponding detail ones.
  • Uses structured bindings to improve code readability
  • Add an early exit before the actual join operation to improve performance

@PointKernel PointKernel added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Feb 11, 2022
@PointKernel PointKernel self-assigned this Feb 11, 2022
@PointKernel PointKernel requested a review from a team as a code owner February 11, 2022 22:17
cpp/src/join/join.cu Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Feb 11, 2022

Codecov Report

Merging #10273 (bd460a2) into branch-22.04 (a7d88cd) will increase coverage by 0.25%.
The diff coverage is n/a.

❗ Current head bd460a2 differs from pull request most recent head f9b1d3d. Consider uploading reports for the commit f9b1d3d to get more accurate results

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10273      +/-   ##
================================================
+ Coverage         10.42%   10.68%   +0.25%     
================================================
  Files               119      122       +3     
  Lines             20603    20857     +254     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18629     +174     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/main.py 0.00% <ø> (ø)
python/cudf/cudf/_version.py 0.00% <ø> (ø)
python/cudf/cudf/comm/gpuarrow.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/column.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/methods.py 0.00% <ø> (ø)
... and 62 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48c4dc3...f9b1d3d. Read the comment docs.

@@ -134,23 +135,24 @@ std::unique_ptr<table> left_join(table_view const& left_input,
table_view const left = scatter_columns(matched.second.front(), left_on, left_input);
table_view const right = scatter_columns(matched.second.back(), right_on, right_input);

auto join_indices = left_join(left.select(left_on), right.select(right_on), compare_nulls);
auto const [left_join_indices, right_join_indices] = cudf::detail::left_join(
left.select(left_on), right.select(right_on), compare_nulls, stream, mr);

if ((left_on.empty() || right_on.empty()) ||
Copy link
Contributor

@bdice bdice Feb 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this if statement is for an early exit with an empty result. However, we're calling cudf::detail::left_join before the early-exit. None of the early exit conditions left_on.empty(), right_on.empty(), or is_trivial_join(left, right, ...) appear to depend on the result of the call to cudf::detail::left_join.

It seems like we could move these lines up, so that the if statement and early exit could be evaluated earlier in the function. Similarly for full_join below. (Am I missing something important?)

Apologies for expanding the scope of the PR review twice -- this just seemed important for performance.

#7454 is where this section of code was last changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! And tbh no need to apologize at all. I'm working on improving the hash join performance #10248 thus found small issues here and there. That's also why this PR exists 😄. The goal is to improve hash join!

@PointKernel PointKernel added improvement Improvement / enhancement to an existing function and removed improvement Improvement / enhancement to an existing function labels Feb 12, 2022
@PointKernel PointKernel changed the title Fix a bug in hash join detail functions: external APIs should not be used Improve hash join detail functions Feb 12, 2022
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. 🔥 🔥 🔥

@PointKernel PointKernel added improvement Improvement / enhancement to an existing function and removed bug Something isn't working labels Feb 12, 2022
@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 7f2a16f into rapidsai:branch-22.04 Feb 12, 2022
@PointKernel PointKernel deleted the fix-hash-join-detail-func-bug branch May 26, 2022 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants