Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional refactoring of hash functions #10462

Merged
merged 18 commits into from
Apr 20, 2022

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Mar 18, 2022

Additional work related to #10081.

This is breaking because it reorganizes several public names/namespaces.

Summary of changes in this PR:

  • The cudf namespace now wraps the contents of hash_functions.cuh, and some public names are now classified as detail APIs.
  • SparkMurmurHash3_32 has been updated to align with the design and naming conventions of MurmurHash3_32

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 18, 2022
@bdice bdice added the improvement Improvement / enhancement to an existing function label Mar 18, 2022
@bdice bdice self-assigned this Mar 18, 2022
@github-actions github-actions bot added the Python Affects Python cuDF API. label Apr 17, 2022
@bdice bdice added the breaking Breaking change label Apr 18, 2022
@@ -593,3 +595,6 @@ struct IdentityHash {

template <typename Key>
using default_hash = MurmurHash3_32<Key>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default_hash be in a detail namespace or not? I put the hash function implementations like MurmurHash3_32 in a detail namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recollection is that the default_hash is only used as the default hash function for various internal hash tables (join, groupby, etc), so it's only necessary for developers. As such, detail seems appropriate.

@bdice bdice marked this pull request as ready for review April 18, 2022 18:59
@bdice bdice requested review from a team as code owners April 18, 2022 18:59
@codecov
Copy link

codecov bot commented Apr 18, 2022

Codecov Report

Merging #10462 (0fcbb23) into branch-22.06 (94a5d41) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10462      +/-   ##
================================================
- Coverage         86.38%   86.38%   -0.01%     
================================================
  Files               142      142              
  Lines             22334    22335       +1     
================================================
  Hits              19294    19294              
- Misses             3040     3041       +1     
Impacted Files Coverage Δ
python/cudf/cudf/utils/gpu_utils.py 50.00% <0.00%> (-4.29%) ⬇️
python/cudf/cudf/core/reshape.py 89.82% <0.00%> (-0.27%) ⬇️
python/cudf/cudf/core/frame.py 93.41% <0.00%> (-0.26%) ⬇️
python/cudf/cudf/comm/gpuarrow.py 79.76% <0.00%> (-0.24%) ⬇️
python/cudf/cudf/core/indexed_frame.py 91.70% <0.00%> (-0.07%) ⬇️
python/cudf/cudf/core/dataframe.py 93.74% <0.00%> (-0.01%) ⬇️
python/cudf/cudf/core/column/column.py 89.45% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 89.77% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 89.22% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.64% <0.00%> (+0.15%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9409559...0fcbb23. Read the comment docs.

@github-actions github-actions bot removed the Python Affects Python cuDF API. label Apr 19, 2022
@bdice bdice removed request for a team and charlesbluca April 19, 2022 20:15
Comment on lines +371 to +374
// Read a 4-byte value from the data pointer as individual bytes for safe
// unaligned access (very likely for string types).
auto block = reinterpret_cast<uint8_t const*>(data + offset);
return block[0] | (block[1] << 8) | (block[2] << 16) | (block[3] << 24);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future PR, but this would be more efficient to round up to the nearest 4B aligned address and do a 4B load and then shift the bits as appropriate to get them back in the expected order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd need to do two 4B loads and shift both results together if you were reading 4B aligned addresses, right? This is meant to handle the case where the 4 bytes are not aligned (could cross 4B alignment boundaries).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you'd have to do two when it's not already aligned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have code for this sitting around in various places.

Copy link
Contributor Author

@bdice bdice Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great. I'll figure that out in a future PR -- there are still other changes I have in mind (e.g. device_span<std::byte>), but this PR is needed to unblock other work so I'm trying to limit its scope.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor thoughts, but LGTM.

cpp/src/io/parquet/chunk_dict.cu Outdated Show resolved Hide resolved
@@ -117,8 +118,8 @@ std::unique_ptr<detail::merge_pairs_map_type> initialize_merge_pairs_map(

merge_pairs_map->insert(iter,
iter + input.size(),
cuco::detail::MurmurHash3_32<hash_value_type>{},
thrust::equal_to<hash_value_type>{},
cuco::detail::MurmurHash3_32<cudf::hash_value_type>{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for this PR (really, out of scope for any PR in this repository), but it seems bad that we're relying on a cuco detail namespace here. @jrhemstad do we need MurmurHash3_32 to be exposed more publicly in cuco? If we expect callers to use it as a provided hash function then it shouldn't be detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is awkward / undesirable. This may be resolved or improved by #10401.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no. The issue is that we need to provide a stream and that requires re-supplying the other defaulted arguments. https://github.com/NVIDIA/cuCollections/blob/fb58a38701f1c24ecfe07d8f1f208bbe80930da5/include/cuco/static_map.cuh#L224-L231

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but IMO it's a design issue in cuco if supplying a stream while using the default hash requires pulling the default hash out of a detail namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that. (Sorry, the "Actually, no" was about whether #10401 would improve this situation. It would not.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I think in the long run we probably don't want to be able to provide the hash function and equality comparator as template parameters of these methods, but rather as parameters of the constructor. It doesn't really make sense to be able to insert and query with different ones. Unfortunately we currently abuse this ability in libcudf, so I don't think removing it is feasible in the short term, but in the longer term getting rid of this would make it easier to provide streams without having this problem since the hash/equality operators would be defined on construction and the user wouldn't need to provide those templates unless they wanted to override the default.

Co-authored-by: Vyas Ramasubramani <[email protected]>
@bdice
Copy link
Contributor Author

bdice commented Apr 20, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit c8c7271 into rapidsai:branch-22.06 Apr 20, 2022
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this pull request Apr 26, 2022
Updated as part of rapidsai/cudf#10462

Update namespace for default_hash.

Also update a python call that was changed in cudf.

Authors:
  - Chuck Hastings (https://github.com/ChuckHastings)

Approvers:
  - Seunghwa Kang (https://github.com/seunghwak)
  - Joseph Nke (https://github.com/jnke2016)
  - Rick Ratzel (https://github.com/rlratzel)

URL: #2244
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants