Adds launch bounds hints to mixed join kernels to address regression seen in NDS q72 in Spark #10534

abellina · 2022-03-29T20:59:17Z

The following change addresses a performance degradation we noticed in the mixed_join and compute_mixed_join_output_size that looks to be tied to the theoretical occupancy of these kernels, as limited by the number of registers used.

The regression is triggered by this patch: #9727, which improves handling of unreachable code paths. That said, somehow, this change is altering the number of registers these kernels need. Both mixed_join and compute_mixed_join_output_size are very sensitive to the register count, per NSight compute. With the patch, the register required changed from 92 to 102, and 118 to 141 respectively.

The fix here hints the compiler what our block size is (128 threads). This, from our testing, allows the compiler to reduce the number of registers required to 128 for compute_mixed_join_output_size and 96 for mixed_join. This lead to better occupancy (I think @nvdbaranec measured it going from 30% to 50%) and I saw the wall clock time of q72 (which started all this) to go from 133s to 121s, which is within the ballpark I'd expect.

vyasr · 2022-03-29T21:02:42Z

@abellina you'll need to run clang-format locally.

If we're going to make these changes we should probably also make them in mixed_join_kernels_semi.cu and mixed_join_size_kernels_semi.cu. I don't know if you have any benchmarks showing impacts there, but in principle those could run into similar problems.

abellina · 2022-03-29T21:04:51Z

@vyasr, thanks. The changed looked good to me so I thought "how could the style be wrong".. well it was, I'll fix shortly.

Happy to add the check to the semi kernels. I can look for a query that uses it. q72 is special because it is dominated by kernel time, especially the mixed join, so it is very sensitive.

vyasr · 2022-03-29T21:44:48Z

CI is now failing here because the black fix in #10523 was not backported to 22.04 (because we were in code freeze and didn't want to push the fix if we didn't have to). I think a decision on whether or not to backport that is probably dependent on whether or not to push forward with this change in 22.04 or 22.06.

vyasr · 2022-03-29T22:18:43Z

This PR is blocked by #10535.

hyperbolic2346

Looks good to me once the lint issues are resolved. Thanks for your help there, @vyasr

abellina · 2022-03-30T03:34:17Z

@vyasr, regarding this:

If we're going to make these changes we should probably also make them in mixed_join_kernels_semi.cu and mixed_join_size_kernels_semi.cu. I don't know if you have any benchmarks showing impacts there, but in principle those could run into similar problems.

I made the same change in mixed_join_kernels_semi.cu and mixed_join_size_kernels_semi.cu, and the effect is pretty minimal. compute_mixed_join_output_size_semi went from 96 to 89 registers (occupancy unchanged at 31.25%) and mixed_join_semi went from 82 to 78 (occupancy changed from 31.25 to 37.5%). I used q94 for this test, which doesn't spend nearly as much time in the join as q72, and the wall clock time of this query didn't show the regression. q94 invokes the semi join 25 times, whereas q72 invokes the mixed inner join 1,055 times.

Given the above, and apologies as my test isn't that useful, do you still want the semi change in this PR?

…upancy_in_mixed_join

abellina · 2022-03-30T14:33:06Z

Discussed with @nvdbaranec offline about the below, I'll add the patch to semi as a separate commit, and if people want to back it out let me know. But at least this way we are consistent.

@vyasr, regarding this:

If we're going to make these changes we should probably also make them in mixed_join_kernels_semi.cu and mixed_join_size_kernels_semi.cu. I don't know if you have any benchmarks showing impacts there, but in principle those could run into similar problems.

I made the same change in mixed_join_kernels_semi.cu and mixed_join_size_kernels_semi.cu, and the effect is pretty minimal. compute_mixed_join_output_size_semi went from 96 to 89 registers (occupancy unchanged at 31.25%) and mixed_join_semi went from 82 to 78 (occupancy changed from 31.25 to 37.5%). I used q94 for this test, which doesn't spend nearly as much time in the join as q72, and the wall clock time of this query didn't show the regression. q94 invokes the semi join 25 times, whereas q72 invokes the mixed inner join 1,055 times.

Given the above, and apologies as my test isn't that useful, do you still want the semi change in this PR?

…s_semi

codecov · 2022-03-30T15:33:05Z

Codecov Report

Merging #10534 (c581fc4) into branch-22.04 (ee3bb0b) will not change coverage.
The diff coverage is n/a.

@@              Coverage Diff              @@
##           branch-22.04   #10534   +/-   ##
=============================================
  Coverage         86.17%   86.17%           
=============================================
  Files               141      141           
  Lines             22510    22510           
=============================================
  Hits              19398    19398           
  Misses             3112     3112

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c42cee3...c581fc4. Read the comment docs.

vyasr · 2022-03-30T15:59:58Z

@abellina I think we may as well include it. The semi kernels are intrinsically less complicated so I'm not surprised that they weren't as sensitive in this case, but you never know what future changes might have an effect here and the launch bounds are accurate so we may as well be consistent as @nvdbaranec says.

cpp/src/join/mixed_join_kernels.cu

ttnghia

This also needs admin merge.

abellina added 2 commits March 29, 2022 15:48

Hint block size in mixed_join_size_kernels

920d173

Hint block size in mixed_join_kernels

e17958d

abellina added Performance Performance related issue non-breaking Non-breaking change labels Mar 29, 2022

abellina requested a review from a team as a code owner March 29, 2022 20:59

abellina requested review from hyperbolic2346 and karthikeyann and removed request for a team March 29, 2022 20:59

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 29, 2022

abellina added the improvement Improvement / enhancement to an existing function label Mar 29, 2022

abellina requested a review from vyasr March 29, 2022 21:00

Fix style issues

7b260f2

hyperbolic2346 approved these changes Mar 30, 2022

View reviewed changes

Merge branch 'branch-22.04' of github.com:rapidsai/cudf into perf/occ…

93d69c9

…upancy_in_mixed_join

Hint block size in mixed_join_kernels_semi and mixed_join_size_kernel…

c581fc4

…s_semi

ttnghia added the Spark Functionality that helps Spark RAPIDS label Mar 30, 2022

ttnghia reviewed Mar 30, 2022

View reviewed changes

cpp/src/join/mixed_join_kernels.cu Show resolved Hide resolved

ttnghia approved these changes Mar 30, 2022

View reviewed changes

ajschmidt8 merged commit 4770599 into rapidsai:branch-22.04 Mar 30, 2022

bdice mentioned this pull request Feb 22, 2023

[FEA] String support for AST expressions #8858

Closed

GregoryKimball mentioned this pull request Jul 5, 2023

[BUG] Improve performance of mixed joins on H100 #13662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds launch bounds hints to mixed join kernels to address regression seen in NDS q72 in Spark #10534

Adds launch bounds hints to mixed join kernels to address regression seen in NDS q72 in Spark #10534

abellina commented Mar 29, 2022

vyasr commented Mar 29, 2022

abellina commented Mar 29, 2022

vyasr commented Mar 29, 2022

vyasr commented Mar 29, 2022

hyperbolic2346 left a comment

abellina commented Mar 30, 2022

abellina commented Mar 30, 2022

codecov bot commented Mar 30, 2022 •

edited

Loading

vyasr commented Mar 30, 2022

ttnghia left a comment

Adds launch bounds hints to mixed join kernels to address regression seen in NDS q72 in Spark #10534

Adds launch bounds hints to mixed join kernels to address regression seen in NDS q72 in Spark #10534

Conversation

abellina commented Mar 29, 2022

vyasr commented Mar 29, 2022

abellina commented Mar 29, 2022

vyasr commented Mar 29, 2022

vyasr commented Mar 29, 2022

hyperbolic2346 left a comment

Choose a reason for hiding this comment

abellina commented Mar 30, 2022

abellina commented Mar 30, 2022

codecov bot commented Mar 30, 2022 • edited Loading

Codecov Report

vyasr commented Mar 30, 2022

ttnghia left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 30, 2022 •

edited

Loading