-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark Decimal128 hashing #9919
Spark Decimal128 hashing #9919
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9919 +/- ##
================================================
- Coverage 10.49% 10.40% -0.09%
================================================
Files 119 119
Lines 20305 20557 +252
================================================
+ Hits 2130 2139 +9
- Misses 18175 18418 +243
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early feedback to get discussions going. I'll need to think some more and revisit this PR.
I left a comment that got lost in a resolved thread:
|
Using |
For the code/kernels I'm touching in this PR I've made the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(review in progress, to be continued)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one other comment for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final round of suggestions. These are non-blocking and can all be punted to a follow-up PR if needed.
cuDF CI will be unstuck once #10008 is merged, but these suggestions might be good to try locally until that PR is merged, and hopefully CI will pass before code freeze.
Ready to merge as soon as CI passes |
@gpucibot merge |
Followup to #9919 -- kernel merging and code cleanup for Murmur3 hash. Partial fix for #10081. Benchmarked `compute_bytes` kernel with aligned read vs unaligned read and saw no difference. Looking into it further to confirm that the `uint32_t` construction was doing the same thing implicitly. Due to byte alignment, the string alignment will require the `getblock32` function regardless. Regardless, the benchmarks ran with 100, 103, and 104 byte strings had negligible performance differences. This reflects forced misalignment not negatively impacting the hash speed. Authors: - Ryan Lee (https://github.com/rwlee) - Bradley Dice (https://github.com/bdice) Approvers: - Bradley Dice (https://github.com/bdice) - Christopher Harris (https://github.com/cwharris) URL: #10143
cudf work for NVIDIA/spark-rapids#3878
Shortens the hashed data by removing preceding zero values -- ensuring the leave a sign bit -- and flipping the endianness before hashing the value.