Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds launch bounds hints to mixed join kernels to address regression seen in NDS q72 in Spark #10534

Merged

Conversation

abellina
Copy link
Contributor

The following change addresses a performance degradation we noticed in the mixed_join and compute_mixed_join_output_size that looks to be tied to the theoretical occupancy of these kernels, as limited by the number of registers used.

The regression is triggered by this patch: #9727, which improves handling of unreachable code paths. That said, somehow, this change is altering the number of registers these kernels need. Both mixed_join and compute_mixed_join_output_size are very sensitive to the register count, per NSight compute. With the patch, the register required changed from 92 to 102, and 118 to 141 respectively.

The fix here hints the compiler what our block size is (128 threads). This, from our testing, allows the compiler to reduce the number of registers required to 128 for compute_mixed_join_output_size and 96 for mixed_join. This lead to better occupancy (I think @nvdbaranec measured it going from 30% to 50%) and I saw the wall clock time of q72 (which started all this) to go from 133s to 121s, which is within the ballpark I'd expect.

@abellina abellina added Performance Performance related issue non-breaking Non-breaking change labels Mar 29, 2022
@abellina abellina requested a review from a team as a code owner March 29, 2022 20:59
@abellina abellina requested review from hyperbolic2346 and karthikeyann and removed request for a team March 29, 2022 20:59
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 29, 2022
@abellina abellina added the improvement Improvement / enhancement to an existing function label Mar 29, 2022
@abellina abellina requested a review from vyasr March 29, 2022 21:00
@vyasr
Copy link
Contributor

vyasr commented Mar 29, 2022

@abellina you'll need to run clang-format locally.

If we're going to make these changes we should probably also make them in mixed_join_kernels_semi.cu and mixed_join_size_kernels_semi.cu. I don't know if you have any benchmarks showing impacts there, but in principle those could run into similar problems.

@abellina
Copy link
Contributor Author

@vyasr, thanks. The changed looked good to me so I thought "how could the style be wrong".. well it was, I'll fix shortly.

Happy to add the check to the semi kernels. I can look for a query that uses it. q72 is special because it is dominated by kernel time, especially the mixed join, so it is very sensitive.

@vyasr
Copy link
Contributor

vyasr commented Mar 29, 2022

CI is now failing here because the black fix in #10523 was not backported to 22.04 (because we were in code freeze and didn't want to push the fix if we didn't have to). I think a decision on whether or not to backport that is probably dependent on whether or not to push forward with this change in 22.04 or 22.06.

@vyasr
Copy link
Contributor

vyasr commented Mar 29, 2022

This PR is blocked by #10535.

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me once the lint issues are resolved. Thanks for your help there, @vyasr

@abellina
Copy link
Contributor Author

@vyasr, regarding this:

If we're going to make these changes we should probably also make them in mixed_join_kernels_semi.cu and mixed_join_size_kernels_semi.cu. I don't know if you have any benchmarks showing impacts there, but in principle those could run into similar problems.

I made the same change in mixed_join_kernels_semi.cu and mixed_join_size_kernels_semi.cu, and the effect is pretty minimal. compute_mixed_join_output_size_semi went from 96 to 89 registers (occupancy unchanged at 31.25%) and mixed_join_semi went from 82 to 78 (occupancy changed from 31.25 to 37.5%). I used q94 for this test, which doesn't spend nearly as much time in the join as q72, and the wall clock time of this query didn't show the regression. q94 invokes the semi join 25 times, whereas q72 invokes the mixed inner join 1,055 times.

Given the above, and apologies as my test isn't that useful, do you still want the semi change in this PR?

@abellina
Copy link
Contributor Author

Discussed with @nvdbaranec offline about the below, I'll add the patch to semi as a separate commit, and if people want to back it out let me know. But at least this way we are consistent.

@vyasr, regarding this:

If we're going to make these changes we should probably also make them in mixed_join_kernels_semi.cu and mixed_join_size_kernels_semi.cu. I don't know if you have any benchmarks showing impacts there, but in principle those could run into similar problems.

I made the same change in mixed_join_kernels_semi.cu and mixed_join_size_kernels_semi.cu, and the effect is pretty minimal. compute_mixed_join_output_size_semi went from 96 to 89 registers (occupancy unchanged at 31.25%) and mixed_join_semi went from 82 to 78 (occupancy changed from 31.25 to 37.5%). I used q94 for this test, which doesn't spend nearly as much time in the join as q72, and the wall clock time of this query didn't show the regression. q94 invokes the semi join 25 times, whereas q72 invokes the mixed inner join 1,055 times.

Given the above, and apologies as my test isn't that useful, do you still want the semi change in this PR?

@codecov
Copy link

codecov bot commented Mar 30, 2022

Codecov Report

Merging #10534 (c581fc4) into branch-22.04 (ee3bb0b) will not change coverage.
The diff coverage is n/a.

@@              Coverage Diff              @@
##           branch-22.04   #10534   +/-   ##
=============================================
  Coverage         86.17%   86.17%           
=============================================
  Files               141      141           
  Lines             22510    22510           
=============================================
  Hits              19398    19398           
  Misses             3112     3112           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c42cee3...c581fc4. Read the comment docs.

@vyasr
Copy link
Contributor

vyasr commented Mar 30, 2022

@abellina I think we may as well include it. The semi kernels are intrinsically less complicated so I'm not surprised that they weren't as sensitive in this case, but you never know what future changes might have an effect here and the launch bounds are accurate so we may as well be consistent as @nvdbaranec says.

@ttnghia ttnghia added the Spark Functionality that helps Spark RAPIDS label Mar 30, 2022
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs admin merge.

@ajschmidt8 ajschmidt8 merged commit 4770599 into rapidsai:branch-22.04 Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants