Additional refactoring of hash functions #10462

bdice · 2022-03-18T22:14:02Z

Additional work related to #10081.

This is breaking because it reorganizes several public names/namespaces.

Summary of changes in this PR:

The cudf namespace now wraps the contents of hash_functions.cuh, and some public names are now classified as detail APIs.
SparkMurmurHash3_32 has been updated to align with the design and naming conventions of MurmurHash3_32

…to its own function.

… broken).

…urrently broken)." This reverts commit 6c293ed.

…actor-2

bdice · 2022-04-18T18:58:53Z

cpp/include/cudf/detail/utilities/hash_functions.cuh

@@ -593,3 +595,6 @@ struct IdentityHash {

 template <typename Key>
 using default_hash = MurmurHash3_32<Key>;


Should the default_hash be in a detail namespace or not? I put the hash function implementations like MurmurHash3_32 in a detail namespace.

My recollection is that the default_hash is only used as the default hash function for various internal hash tables (join, groupby, etc), so it's only necessary for developers. As such, detail seems appropriate.

python/cudf/cudf/core/resample.py

codecov · 2022-04-18T20:13:41Z

Codecov Report

Merging #10462 (0fcbb23) into branch-22.06 (94a5d41) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10462      +/-   ##
================================================
- Coverage         86.38%   86.38%   -0.01%     
================================================
  Files               142      142              
  Lines             22334    22335       +1     
================================================
  Hits              19294    19294              
- Misses             3040     3041       +1

Impacted Files	Coverage Δ
python/cudf/cudf/utils/gpu_utils.py	`50.00% <0.00%> (-4.29%)`	⬇️
python/cudf/cudf/core/reshape.py	`89.82% <0.00%> (-0.27%)`	⬇️
python/cudf/cudf/core/frame.py	`93.41% <0.00%> (-0.26%)`	⬇️
python/cudf/cudf/comm/gpuarrow.py	`79.76% <0.00%> (-0.24%)`	⬇️
python/cudf/cudf/core/indexed_frame.py	`91.70% <0.00%> (-0.07%)`	⬇️
python/cudf/cudf/core/dataframe.py	`93.74% <0.00%> (-0.01%)`	⬇️
python/cudf/cudf/core/column/column.py	`89.45% <0.00%> (ø)`
python/cudf/cudf/core/column/categorical.py	`89.77% <0.00%> (ø)`
python/cudf/cudf/core/column/string.py	`89.22% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.64% <0.00%> (+0.15%)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9409559...0fcbb23. Read the comment docs.

cpp/include/cudf/detail/utilities/hash_functions.cuh

jrhemstad · 2022-04-19T21:33:08Z

cpp/include/cudf/detail/utilities/hash_functions.cuh

+    // Read a 4-byte value from the data pointer as individual bytes for safe
+    // unaligned access (very likely for string types).
+    auto block = reinterpret_cast<uint8_t const*>(data + offset);
+    return block[0] | (block[1] << 8) | (block[2] << 16) | (block[3] << 24);


Future PR, but this would be more efficient to round up to the nearest 4B aligned address and do a 4B load and then shift the bits as appropriate to get them back in the expected order.

You'd need to do two 4B loads and shift both results together if you were reading 4B aligned addresses, right? This is meant to handle the case where the 4 bytes are not aligned (could cross 4B alignment boundaries).

Yep, you'd have to do two when it's not already aligned.

We have code for this sitting around in various places.

Ok great. I'll figure that out in a future PR -- there are still other changes I have in mind (e.g. device_span<std::byte>), but this PR is needed to unblock other work so I'm trying to limit its scope.

vyasr

A few minor thoughts, but LGTM.

cpp/src/io/parquet/chunk_dict.cu

vyasr · 2022-04-20T17:56:24Z

cpp/src/text/subword/load_merges_file.cu

@@ -117,8 +118,8 @@ std::unique_ptr<detail::merge_pairs_map_type> initialize_merge_pairs_map(

  merge_pairs_map->insert(iter,
                          iter + input.size(),
-                          cuco::detail::MurmurHash3_32<hash_value_type>{},
-                          thrust::equal_to<hash_value_type>{},
+                          cuco::detail::MurmurHash3_32<cudf::hash_value_type>{},


Out of scope for this PR (really, out of scope for any PR in this repository), but it seems bad that we're relying on a cuco detail namespace here. @jrhemstad do we need MurmurHash3_32 to be exposed more publicly in cuco? If we expect callers to use it as a provided hash function then it shouldn't be detail.

I agree this is awkward / undesirable. This may be resolved or improved by #10401.

Actually, no. The issue is that we need to provide a stream and that requires re-supplying the other defaulted arguments. https://github.com/NVIDIA/cuCollections/blob/fb58a38701f1c24ecfe07d8f1f208bbe80930da5/include/cuco/static_map.cuh#L224-L231

Sure, but IMO it's a design issue in cuco if supplying a stream while using the default hash requires pulling the default hash out of a detail namespace.

I agree with that. (Sorry, the "Actually, no" was about whether #10401 would improve this situation. It would not.)

Got it. I think in the long run we probably don't want to be able to provide the hash function and equality comparator as template parameters of these methods, but rather as parameters of the constructor. It doesn't really make sense to be able to insert and query with different ones. Unfortunately we currently abuse this ability in libcudf, so I don't think removing it is feasible in the short term, but in the longer term getting rid of this would make it easier to provide streams without having this problem since the hash/equality operators would be defined on construction and the user wouldn't need to provide those templates unless they wanted to override the default.

cpp/include/cudf/detail/utilities/hash_functions.cuh

Co-authored-by: Vyas Ramasubramani <[email protected]>

bdice · 2022-04-20T18:35:44Z

@gpucibot merge

Updated as part of rapidsai/cudf#10462 Update namespace for default_hash. Also update a python call that was changed in cudf. Authors: - Chuck Hastings (https://github.com/ChuckHastings) Approvers: - Seunghwa Kang (https://github.com/seunghwak) - Joseph Nke (https://github.com/jnke2016) - Rick Ratzel (https://github.com/rlratzel) URL: #2244

bdice added 9 commits March 8, 2022 12:48

Refactor float normalization.

63f187a

Refactor namespaces.

84de276

Remove this-> for consistency.

bda5910

Unify Spark/non-Spark implementations and separate tail processing in…

13b831a

…to its own function.

Move MurmurHash3_32 and default_hash into cudf::detail.

6929e63

Make SparkMurmurHash3_32 inherit from MurmurHash3_32 (tests currently…

6c293ed

… broken).

Revert "Make SparkMurmurHash3_32 inherit from MurmurHash3_32 (tests c…

7cdec5f

…urrently broken)." This reverts commit 6c293ed.

Make default constructor constexpr.

cafd0b3

Define hash_value_type in cudf namespace.

a24f52d

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 18, 2022

Merge branch 'branch-22.06' into hashing-refactor-2

bd0d981

bdice added the improvement Improvement / enhancement to an existing function label Mar 18, 2022

bdice self-assigned this Mar 18, 2022

Merge remote-tracking branch 'upstream/branch-22.06' into hashing-ref…

8caa85c

…actor-2

github-actions bot added the Python Affects Python cuDF API. label Apr 17, 2022

bdice added 2 commits April 17, 2022 13:54

Replace rotl32 with rotate_bits_left.

7927735

Update bpe_tokenizer.cuh.

f02fa68

bdice added the breaking Breaking change label Apr 18, 2022

bdice added 2 commits April 18, 2022 10:59

Merge remote-tracking branch 'upstream/branch-22.06' into hashing-ref…

b01a59c

…actor-2

Fix subword includes.

e491115

bdice commented Apr 18, 2022

View reviewed changes

python/cudf/cudf/core/resample.py Outdated Show resolved Hide resolved

bdice marked this pull request as ready for review April 18, 2022 18:59

bdice requested review from a team as code owners April 18, 2022 18:59

bdice requested review from vyasr, charlesbluca and hyperbolic2346 April 18, 2022 18:59

Revert copyright change.

1e63821

github-actions bot removed the Python Affects Python cuDF API. label Apr 19, 2022

bdice removed request for a team and charlesbluca April 19, 2022 20:15

bdice mentioned this pull request Apr 19, 2022

Add row hasher with nested column support #10641

Merged

jrhemstad reviewed Apr 19, 2022

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

jrhemstad reviewed Apr 19, 2022

View reviewed changes

jrhemstad approved these changes Apr 19, 2022

View reviewed changes

Add [[fallthrough]].

0a17018

vyasr approved these changes Apr 20, 2022

View reviewed changes

Make operator() const.

0fcbb23

Co-authored-by: Vyas Ramasubramani <[email protected]>

rapids-bot bot merged commit c8c7271 into rapidsai:branch-22.06 Apr 20, 2022

ChuckHastings mentioned this pull request Apr 25, 2022

cudf moved the default_hash into the cudf::detail namespace rapidsai/cugraph#2244

Merged

benfred mentioned this pull request Aug 23, 2022

[BUG] HugeCTR doesn't compile with cudf v22.06+ NVIDIA-Merlin/HugeCTR#353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional refactoring of hash functions #10462

Additional refactoring of hash functions #10462

bdice commented Mar 18, 2022 •

edited

Loading

bdice Apr 18, 2022

vyasr Apr 20, 2022

codecov bot commented Apr 18, 2022 •

edited

Loading

jrhemstad Apr 19, 2022

bdice Apr 19, 2022

jrhemstad Apr 19, 2022

jrhemstad Apr 19, 2022

bdice Apr 19, 2022 •

edited

Loading

vyasr left a comment

vyasr Apr 20, 2022

bdice Apr 20, 2022

bdice Apr 20, 2022

vyasr Apr 20, 2022

bdice Apr 20, 2022

vyasr Apr 20, 2022

bdice commented Apr 20, 2022

		@@ -593,3 +595,6 @@ struct IdentityHash {

		template <typename Key>
		using default_hash = MurmurHash3_32<Key>;

Additional refactoring of hash functions #10462

Additional refactoring of hash functions #10462

Conversation

bdice commented Mar 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Apr 18, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Apr 19, 2022 • edited Loading

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice commented Apr 20, 2022

bdice commented Mar 18, 2022 •

edited

Loading

codecov bot commented Apr 18, 2022 •

edited

Loading

bdice Apr 19, 2022 •

edited

Loading