-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Murmur3 hash kernel cleanup #10143
Murmur3 hash kernel cleanup #10143
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10143 +/- ##
================================================
- Coverage 10.42% 10.16% -0.26%
================================================
Files 119 122 +3
Lines 20603 24693 +4090
================================================
+ Hits 2148 2511 +363
- Misses 18455 22182 +3727
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some starting ideas for us to refactor. We can work through these in a pairing session.
constexpr uint32_t c3 = 0xe6546b64; | ||
constexpr uint32_t rot_c1 = 15; | ||
constexpr uint32_t rot_c2 = 13; | ||
auto getblock32 = [] __device__(uint32_t const* p, int i) -> uint32_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pull this lambda out into a separate device function (not define it as an inline lambda) - like rotl32
and fmix32
.
See additional comments below about why this isn't safe to take uint32 const*
and must instead take std::byte const*
. However, then there's no need to have the offset parameter int i
because we can do that with pointer arithmetic at the call site.
k1 *= c1; | ||
k1 = rotl32(k1, 15); | ||
k1 = rotl32(k1, rot_c1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future PR: We might define common functions and magic values between MurmurHash3_32
and SparkMurmurHash3_32
like rotl32
and fmix32
and getblock32
in a common base class, and only override the Spark-specific bits in a derived class. CRTP might be an even better choice, like I did for the SHA-family functions (draft #9215) - just needs a bit of analysis to decide which way to go.
original_benchmark.txt There's a very tiny performance hit after the most recent set of change, I don't think it's large enough to be a concern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is beneficial and its scope is reasonable -- I'd hesitate to make it larger since we're at a stopping point and have good benchmarks. I have some ideas for further refactors that I'd like to take on later.
I'm going to apply my own suggestions to renaming variables and will merge branch-22.04
so we can get an updated build time metrics report.
@@ -105,6 +105,14 @@ struct MurmurHash3_32 { | |||
return h; | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self for a future PR: Do we need MurmurHash3_32
to be a templated class? Currently the class takes a template parameter Key
and has an operator()(Key key)
with no template parameters which calls a templated compute(T key)
. However, the way it's called in row_operators.cuh
seems to indicate that we could instead have a plain (non-template) class with a templated operator()
. That's the way we typically do type dispatching, and it's reversed here for no clear reason. The calling code uses a type dispatch on element_hasher_with_seed
.
(This would probably affect performance and/or compile time but I don't know if it would be better or worse.)
@@ -131,60 +139,69 @@ struct MurmurHash3_32 { | |||
return combined; | |||
} | |||
|
|||
result_type __device__ inline operator()(Key const& key) const { return compute(key); } | |||
// TODO Do we need this operator() and/or compute? Probably not both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self for a future PR: I would try removing the compute
method and move its definition to operator()
. I think we might be able to safely remove the template <typename T>
on compute(T)
. The operator()
template parameter T
has to match the class template parameter Key
, from what I can see, and may be redundant. Any exceptions to this would probably be solved by removing the class template parameter Key
and switching to just an operator()
template parameter.
rerun tests |
@cwharris This is ready for review, I just fixed the labels so it no longer says “in progress.” 😊 |
@gpucibot merge |
Followup to #9919 -- kernel merging and code cleanup for Murmur3 hash.
Partial fix for #10081.
Benchmarked
compute_bytes
kernel with aligned read vs unaligned read and saw no difference. Looking into it further to confirm that theuint32_t
construction was doing the same thing implicitly.Due to byte alignment, the string alignment will require the
getblock32
function regardless. Regardless, the benchmarks ran with 100, 103, and 104 byte strings had negligible performance differences. This reflects forced misalignment not negatively impacting the hash speed.