-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional refactoring of hash functions #10462
Conversation
…to its own function.
…urrently broken)." This reverts commit 6c293ed.
@@ -593,3 +595,6 @@ struct IdentityHash { | |||
|
|||
template <typename Key> | |||
using default_hash = MurmurHash3_32<Key>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default_hash
be in a detail
namespace or not? I put the hash function implementations like MurmurHash3_32
in a detail
namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recollection is that the default_hash
is only used as the default hash function for various internal hash tables (join, groupby, etc), so it's only necessary for developers. As such, detail
seems appropriate.
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10462 +/- ##
================================================
- Coverage 86.38% 86.38% -0.01%
================================================
Files 142 142
Lines 22334 22335 +1
================================================
Hits 19294 19294
- Misses 3040 3041 +1
Continue to review full report at Codecov.
|
// Read a 4-byte value from the data pointer as individual bytes for safe | ||
// unaligned access (very likely for string types). | ||
auto block = reinterpret_cast<uint8_t const*>(data + offset); | ||
return block[0] | (block[1] << 8) | (block[2] << 16) | (block[3] << 24); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future PR, but this would be more efficient to round up to the nearest 4B aligned address and do a 4B load and then shift the bits as appropriate to get them back in the expected order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd need to do two 4B loads and shift both results together if you were reading 4B aligned addresses, right? This is meant to handle the case where the 4 bytes are not aligned (could cross 4B alignment boundaries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you'd have to do two when it's not already aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have code for this sitting around in various places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok great. I'll figure that out in a future PR -- there are still other changes I have in mind (e.g. device_span<std::byte>
), but this PR is needed to unblock other work so I'm trying to limit its scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor thoughts, but LGTM.
@@ -117,8 +118,8 @@ std::unique_ptr<detail::merge_pairs_map_type> initialize_merge_pairs_map( | |||
|
|||
merge_pairs_map->insert(iter, | |||
iter + input.size(), | |||
cuco::detail::MurmurHash3_32<hash_value_type>{}, | |||
thrust::equal_to<hash_value_type>{}, | |||
cuco::detail::MurmurHash3_32<cudf::hash_value_type>{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope for this PR (really, out of scope for any PR in this repository), but it seems bad that we're relying on a cuco detail namespace here. @jrhemstad do we need MurmurHash3_32
to be exposed more publicly in cuco
? If we expect callers to use it as a provided hash function then it shouldn't be detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is awkward / undesirable. This may be resolved or improved by #10401.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no. The issue is that we need to provide a stream and that requires re-supplying the other defaulted arguments. https://github.com/NVIDIA/cuCollections/blob/fb58a38701f1c24ecfe07d8f1f208bbe80930da5/include/cuco/static_map.cuh#L224-L231
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but IMO it's a design issue in cuco if supplying a stream while using the default hash requires pulling the default hash out of a detail namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that. (Sorry, the "Actually, no" was about whether #10401 would improve this situation. It would not.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I think in the long run we probably don't want to be able to provide the hash function and equality comparator as template parameters of these methods, but rather as parameters of the constructor. It doesn't really make sense to be able to insert and query with different ones. Unfortunately we currently abuse this ability in libcudf, so I don't think removing it is feasible in the short term, but in the longer term getting rid of this would make it easier to provide streams without having this problem since the hash/equality operators would be defined on construction and the user wouldn't need to provide those templates unless they wanted to override the default.
Co-authored-by: Vyas Ramasubramani <[email protected]>
@gpucibot merge |
Updated as part of rapidsai/cudf#10462 Update namespace for default_hash. Also update a python call that was changed in cudf. Authors: - Chuck Hastings (https://github.com/ChuckHastings) Approvers: - Seunghwa Kang (https://github.com/seunghwak) - Joseph Nke (https://github.com/jnke2016) - Rick Ratzel (https://github.com/rlratzel) URL: #2244
Additional work related to #10081.
This is breaking because it reorganizes several public names/namespaces.
Summary of changes in this PR:
cudf
namespace now wraps the contents ofhash_functions.cuh
, and some public names are now classified asdetail
APIs.SparkMurmurHash3_32
has been updated to align with the design and naming conventions ofMurmurHash3_32