-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MD5 refactoring. #10445
MD5 refactoring. #10445
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some explanations to help reviewers.
@@ -48,6 +48,107 @@ T __device__ inline normalize_nans_and_zeros(T const& key) | |||
return key; | |||
} | |||
|
|||
__device__ inline uint32_t rotate_bits_left(uint32_t x, uint32_t r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved some common functions/classes into this header so that it can be re-used by the SHA hash in a follow-up PR. This code was not edited from the previous file except to replace some int8
arguments with uint32_t
arguments for the bit rotation functions. This change aligns with the input data types and expected types for the CUDA intrinsics __funnelshift_l
/ __funnelshift_r
.
HASH_BENCHMARK_DEFINE(HASH_SERIAL_MURMUR3, nulls) | ||
HASH_BENCHMARK_DEFINE(HASH_SPARK_MURMUR3, nulls) | ||
HASH_BENCHMARK_DEFINE(HASH_MD5, nulls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good chunk of the diffs in this PR are just from moving code around. I put MD5 at the bottom of the list of hashing functions defined in the hash_id
enum, and reorganized everything else to have the same order. That means that the MD5 and SHA features can be listed next to each other in a subsequent PR #9215. MD5 and SHA are in the same family of cryptographic hash functions, so it's a logical grouping.
@@ -90,11 +191,6 @@ struct MurmurHash3_32 { | |||
MurmurHash3_32() = default; | |||
constexpr MurmurHash3_32(uint32_t seed) : m_seed(seed) {} | |||
|
|||
[[nodiscard]] __device__ inline uint32_t rotl32(uint32_t x, uint32_t r) const | |||
{ | |||
return __funnelshift_l(x, x, r); // Equivalent to (x << r) | (x >> (32 - r)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer need to define rotl32
in MurmurHash3_32
because it's identical to the cudf::detail::rotate_bits_left
utility that I moved into this file.
cpp/src/hash/md5_hash.cu
Outdated
return make_strings_column( | ||
input.num_rows(), std::move(offsets_column), std::move(chars_column), 0, std::move(null_mask)); | ||
// Build an output null mask from the logical AND of all input columns' null masks. | ||
auto [null_mask, null_count] = cudf::detail::bitmask_and(input, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the null behavior so that a hash of null input data returns a null value, rather than the hash of an empty byte string. This ensures that users can distinguish between the hash of empty bytes and the hash of null inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I retracted this change in behavior because it caused tests to fail. We do need to fix this so that null inputs produce null outputs, and I will open a separate PR to fix it.
], | ||
) | ||
def test_series_hash_values(method, validation_data): | ||
@pytest.mark.parametrize("method", ["md5"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list of methods will be expanded in #9215.
auto __device__ inline get_element_pointer_and_size(Element const& element) | ||
{ | ||
if constexpr (is_fixed_width<Element>() && !is_chrono<Element>()) { | ||
return thrust::make_pair(reinterpret_cast<uint8_t const*>(&element), sizeof(Element)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File this away for future imrpovement, but this should just return a device_span<byte>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10445 +/- ##
================================================
+ Coverage 86.13% 86.18% +0.04%
================================================
Files 139 139
Lines 22438 22468 +30
================================================
+ Hits 19328 19363 +35
+ Misses 3110 3105 -5
Continue to review full report at Codecov.
|
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enum change in HashType.java looks good to me.
@gpucibot merge |
This PR refactors the MD5 hashing functionality. It moves some code that will be shared logic for SHA hashing (#9215), and reduces the diff of that PR to make it easier to review.