-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor MD5 implementation. #9212
Conversation
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9212 +/- ##
===============================================
Coverage ? 10.75%
===============================================
Files ? 116
Lines ? 19480
Branches ? 0
===============================================
Hits ? 2096
Misses ? 17384
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome. I would be curious to know if this improved overall build time as well since hash_functions.cuh
is also included in the row_operators.cuh
header file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need to remove some unreferenced variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice cleanup. I recommended some further improvements, but be aware that I may not have caught all of the places where those improvements could be made because there seems to be quite a bit of code duplication in this file. I don't know if you want to try and tackle it in this PR or not, but it may be possible to unify a lot of these implementations to a greater degree with some suitably defined device functions. If you think that's out of scope that's fine, just double-check that my suggestions wouldn't also apply in other places.
@davidwendt @vyasr Thanks for the great suggestions! I'll follow up on this soon. Many of these suggestions will also help improve #9215. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any reason to block this PR beyond other reviewers' comments. Just one question.
@vyasr @davidwendt If you want to do a final review of this PR, feel free. I'm aiming to merge by EOD tomorrow to unblock some next steps on SHA work. Thanks! I ran benchmarks. For a column of |
For performance, I think the lowest hanging fruit is to move the |
@jrhemstad I agree shared memory should be a good perf boost -- I'll give that a shot when I refactor with the SHA code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big improvement. Got some further suggestions before merging, but it's almost there. Any performance impact of these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big improvement. Got some further suggestions before merging, but it's almost there. Any performance impact of these changes?
@vyasr This is ready for a final review.
See #9212 (comment). It's slightly slower but I'll refactor this to use shared memory when refactoring with SHA, which should boost performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple small questions but I think this is good to go however you choose to address them.
@gpucibot merge |
This PR refactors the MD5 hash implementation in libcudf. I used the MD5 code as a reference while working on SHA (extending #6020, PR #9215 to follow).
List of high-level changes:
MD5Hash
and related logic frominclude/cudf/detail/utilities/hash_functions.cuh
tosrc/hash/md5_hash.cu
because it is only used in that file and nowhere else. We don't need to include and build MD5 inhash_functions.cuh
for all the collections/sorting/groupby tools that only use Murmur3 variants andIdentityHash
. (This will be a bigger deal once we add the SHA hash functions, soon to follow this PR, because the size ofhash_functions.cuh
would be substantially larger without this separation.)MD5Hash
constructor that accepted and stored a seed whose value was unused.hash_circular_buffer
and refactored dispatch logic.No changes were made to the feature scope or public APIs of the MD5 feature, so existing unit tests and bindings should remain the same.