-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add MD5 to existing hashing functionality #5438
Conversation
@rwlee we may have a project around this, worth syncing? |
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
1 similar comment
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good.
Some mismatches in code style: lack of auto use, west const, C-style casts. Not a big deal, but it would be great to keep the code consistent.
uint32_t C = hash_state->hash_value[2]; | ||
uint32_t D = hash_state->hash_value[3]; | ||
|
||
uint32_t* buffer_ints = (uint32_t*)hash_state->buffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use reinterpret_cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to a memcpy
std::memcpy(&buffer_element_as_int, hash_state->buffer + g * 4, 4); |
const md5_hash_constants_type* hash_constants, | ||
const md5_shift_constants_type* shift_constants) const | ||
{ | ||
uint8_t* data = (uint8_t*)&key; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reinterpret_cast here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint8_t const* data = reinterpret_cast<uint8_t const*>(&key); |
uint64_t full_length = (uint64_t)hash_state->message_length; | ||
full_length = full_length << 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint64_t full_length = (uint64_t)hash_state->message_length; | |
full_length = full_length << 3; | |
auto const full_length = (static_cast<uint64_t>hash_state->message_length) << 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto const full_length = (static_cast<uint64_t>(hash_state->message_length)) << 3; |
cpp/src/hash/hash_constants.cu
Outdated
/** | ||
* @copydoc cudf::detail::get_hex_to_char_mapping | ||
*/ | ||
const hex_to_char_mapping_type* get_hex_to_char_mapping() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like other places that require global thread-safe initialization use thread_safe_per_context_cache
instead of a std::mutex
. Can you use it here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to thread_safe_per_context_cache, then removed entirely in favor of __device__ __constant__
cpp/include/cudf/detail/hashing.hpp
Outdated
std::vector<uint32_t> const& initial_hash = {}, | ||
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(), | ||
cudaStream_t stream = 0); | ||
|
||
std::unique_ptr<column> identity_hash( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be the wrapper call for the IdentityHash function, I did not end up implementing it and will remove it.
CUDF_FAIL("Unsupported hash type"); | ||
} | ||
|
||
void CUDA_HOST_DEVICE_CALLABLE operator()(Key const& key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this one empty? Might warrant a comment if I'm not missing something trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a bunch of testing today to understand and explain this issue better. This operatore()
is acting as a catch all case representing default behavior, that allows me to override specific cases like a string_view outside of the struct. Without this un-templated operator function, I get a bunch of compile errors ../include/cudf/detail/utilities/hash_functions.cuh(165): error: no suitable constructor exists to convert from "const __nv_bool" to "cudf::data_type"
for a bunch of different types.
Ideally the default behavior should actually be a CUDF_FAIL("Unsupported hash type")
but adding that to the empty function causes ../include/cudf/detail/utilities/hash_functions.cuh(173): error: device code does not support exception handling
errors.
During my testing this afternoon, non-fixed width types never hit the CUDF_FAIL
on line 146 -- the same was seen for other types I was trying to filter out as unsupported column data types. It's clear the current method of filtering out unsupported types doesn't work, any guidance on how to fix this would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this isn't right. You're not hitting CUDF_FAIL
because you're doing device-side dispatch and calling your MD5Hash object in device code. CUDF_FAIL
is a host-only construct. I'm surprised this even compiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: the advice does not stand, see the row_hasher
instead as suggested below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest looking at row_hasher
here:
cudf/cpp/include/cudf/table/row_operators.cuh
Line 411 in 855e735
class row_hasher { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Required a bit of a rework, the use of size_of
rather than sizeof
was causing failures with primitive types and were being type dispatched to other operator functions.
Breaks the python murmur3 hash functionality, fixing it now but will likely require another review. |
Codecov Report
@@ Coverage Diff @@
## branch-0.15 #5438 +/- ##
===============================================
+ Coverage 84.08% 84.43% +0.35%
===============================================
Files 80 80
Lines 13062 13424 +362
===============================================
+ Hits 10983 11335 +352
- Misses 2079 2089 +10
Continue to review full report at Codecov.
|
cpp/include/cudf/hashing.hpp
Outdated
std::vector<uint32_t> const& initial_hash = {}, | ||
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource()); | ||
|
||
std::unique_ptr<column> murmur_hash3_32( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want the same function exposed two ways. Either the Python should be updated to call the hash
with the HASH_MURMUR3
enum or get rid of the hash
API and add a specific MD5 hash. I'd opt for the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mirrored the hash_id
enum over to the python side and removed the murmur hash specific function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python changes LGTM
cpp/src/hash/hash_constants.cu
Outdated
/** | ||
* @copydoc cudf::detail::get_md5_hash_constants | ||
*/ | ||
const md5_hash_constants_type* get_md5_hash_constants() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__constant__ md5_shift_constants_type g_md5_shift_constants[16]
= { 7, 12, 17, 22, 5, 9, 14, 20, 4, 11, 16, 23, 6, 10, 15, 21,};
This initialization will happen automatically in device memory (unsure when. likely when cuda context is created)
Limit to Constant memory size is 64 KB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks OK to me, but I didn't see any java code changes, so not really sure why a java code owner review is needed for this change.
rerun tests |
Resolves #4989
Refactors hashing support in preparation for MD5 support.