Refactor hash functions and `hash_combine` #10379

bdice · 2022-03-01T23:43:40Z

This PR refactors a few pieces of libcudf's hash functions:

Define the utility function hash_combine only once (with 32/64 bit overloads), rather than several times in the codebase
Remove class template parameter from MurmurHash3_32 and related classes. This template parameter was redundant. We already use a template for the argument of the compute method, which is called by operator(), so I put the template parameter on operator() instead of the whole class. I think this removal of the template parameter could be considered API-breaking so I added the breaking label. I retracted this change after conversation with @jrhemstad. I'll look into a different way to do this soon, using a dispatch-to-invoke approach as in Add dispatch_to_invoke for better type dispatching #8217.

This addresses part of issue #10081. I have a few more things I'd like to try, but this felt like a nicely-scoped PR so I stopped here for the moment.

Benchmarking info (outdated)

I benchmarked the code before and after making these changes and saw a small but consistent decrease in runtime.

The benchmarks in HashBenchmark/{HASH_MURMUR3,HASH_SERIAL_MURMUR3,HASH_SPARK_MURMUR3}_{nulls,no_nulls}/* all decreased or saw no change in runtime, with a geometric mean of 2.87% less time.

The benchmarks in Hashing/hash_partition/* all decreased or saw no change in runtime, with a geometric mean of 2.37% less time.

For both sets of benchmarks, the largest data sizes saw more significant decreases in runtime, with a best-improvement of 7.38% less time in HashBenchmark/HASH_MURMUR3_nulls/16777216 (similar for other large data sizes) and a best-improvement of 10.54% less time in Hashing/hash_partition/1048576/256/64 (similar for other large data sizes).

See comment below for updated benchmarks.

…e instead.

codecov · 2022-03-02T01:37:02Z

Codecov Report

Merging #10379 (bd4bb1d) into branch-22.04 (1e5b01f) will increase coverage by 0.00%.
The diff coverage is 0.00%.

❗ Current head bd4bb1d differs from pull request most recent head 65edea2. Consider uploading reports for the commit 65edea2 to get more accurate results

@@              Coverage Diff              @@
##           branch-22.04   #10379   +/-   ##
=============================================
  Coverage         10.50%   10.50%           
=============================================
  Files               126      127    +1     
  Lines             21218    21200   -18     
=============================================
  Hits               2228     2228           
+ Misses            18990    18972   -18

Impacted Files	Coverage Δ
python/cudf/cudf/core/_base_index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/numerical_base.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/dataframe.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/groupby/groupby.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/indexed_frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/mixins/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/mixins/scans.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/multiindex.py	`0.00% <0.00%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e5b01f...65edea2. Read the comment docs.

cpp/include/cudf/detail/hashing.hpp

cpp/include/cudf/detail/utilities/hash_functions.cuh

vuule

Thanks for the detailed description, made it very easy to review the code :)
Looks good 👍

jrhemstad · 2022-03-02T21:28:32Z

cpp/include/cudf/detail/utilities/hash_functions.cuh

@@ -83,16 +83,15 @@ void __device__ inline uint32ToLowercaseHexString(uint32_t num, char* destinatio
 // algorithms are optimized for their respective platforms. You can still
 // compile and run any of them on any platform, but your performance with the
 // non-native version will be less than optimal.
-template <typename Key>


The precedent in C++ hash function objects is that the type itself is a template and not callable operator.

https://en.cppreference.com/w/cpp/utility/hash

This is important if a hash function for different types requires different state or different behavior in the ctor to initialize state.

@jrhemstad I wasn't familiar with that, thanks for the reference. Do you think we need to retain that class template behavior, or is this comment just informational to explain the previous design? In practice, we don't have any constructor behavior that varies by key type. It seems like a hash functor with type specializations of operator() is more in line with the rest of libcudf's design, unless you think that matching std::hash is crucial here.

I'd prefer to keep the class template.

It seems like a hash functor with type specializations of operator() is more in line with the rest of libcudf's design

Class templates are actually more in line with "dispatch to invoke" pattern I've advocated for in the past. See: https://github.com/rapidsai/cudf/pull/8217/files and the Better Code talk I gave about this pattern.

codereport

LGreatTM 👍

cpp/include/cudf/detail/utilities/hash_functions.cuh

…actor

…d compute instead." This reverts commit 466243b.

bdice · 2022-03-07T22:34:38Z

rerun tests

bdice · 2022-03-08T20:06:51Z

I wanted to check the benchmarks again before merging. Commit 65edea2 shows the same performance as branch-22.04 on average, but benchmarks of small data sizes are about 2% slower in 65edea2 than branch-22.04 and benchmarks of large data sizes are about 6% faster in 65edea2 than branch-22.04. Since the performance has improved for large data sizes and shows only a small slowdown for small data sizes, I will merge this.

bdice · 2022-03-08T20:06:58Z

@gpucibot merge

bdice added 3 commits March 1, 2022 14:10

Unify implementations of hash_combine.

eeb93de

Use intrinsic.

0651f38

Remove template from hashing class, template on operator() and comput…

466243b

…e instead.

bdice added libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function breaking Breaking change labels Mar 1, 2022

bdice requested a review from a team as a code owner March 1, 2022 23:43

bdice self-assigned this Mar 1, 2022

bdice requested a review from vuule March 1, 2022 23:43

bdice requested a review from codereport March 1, 2022 23:43

bdice changed the title ~~Refactor hash functions~~ Refactor hash function templates and hash_combine Mar 2, 2022

bdice mentioned this pull request Mar 2, 2022

Improvements in hash_functions.cuh #10081

Closed

bdice changed the title ~~Refactor hash function templates and hash_combine~~ Refactor hash function templates and hash_combine Mar 2, 2022

vuule reviewed Mar 2, 2022

View reviewed changes

cpp/include/cudf/detail/hashing.hpp Outdated Show resolved Hide resolved

vuule reviewed Mar 2, 2022

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

bdice added 3 commits March 2, 2022 14:22

Remove periods.

60b302b

Don't name unused parameters.

8593258

clang-format

65c8593

bdice requested a review from vuule March 2, 2022 20:24

Fix docstring briefs.

2895c57

vuule approved these changes Mar 2, 2022

View reviewed changes

jrhemstad reviewed Mar 2, 2022

View reviewed changes

codereport approved these changes Mar 3, 2022

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Show resolved Hide resolved

bdice added 3 commits March 7, 2022 11:18

Merge remote-tracking branch 'upstream/branch-22.04' into hashing-ref…

0d9df03

…actor

Revert "Remove template from hashing class, template on operator() an…

c972ba4

…d compute instead." This reverts commit 466243b.

Apply some reverted changed related to code organization.

65edea2

rapids-bot bot merged commit 600b872 into rapidsai:branch-22.04 Mar 8, 2022

bdice changed the title ~~Refactor hash function templates and hash_combine~~ Refactor hash functions and hash_combine Mar 8, 2022

bdice added non-breaking Non-breaking change and removed breaking Breaking change labels Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor hash functions and `hash_combine` #10379

Refactor hash functions and `hash_combine` #10379

bdice commented Mar 1, 2022 •

edited

Loading

codecov bot commented Mar 2, 2022 •

edited

Loading

vuule left a comment

jrhemstad Mar 2, 2022 •

edited

Loading

bdice Mar 2, 2022

jrhemstad Mar 7, 2022

codereport left a comment

bdice commented Mar 7, 2022

bdice commented Mar 8, 2022

bdice commented Mar 8, 2022

Refactor hash functions and hash_combine #10379

Refactor hash functions and hash_combine #10379

Conversation

bdice commented Mar 1, 2022 • edited Loading

codecov bot commented Mar 2, 2022 • edited Loading

Codecov Report

vuule left a comment

Choose a reason for hiding this comment

jrhemstad Mar 2, 2022 • edited Loading

Choose a reason for hiding this comment

bdice Mar 2, 2022

Choose a reason for hiding this comment

jrhemstad Mar 7, 2022

Choose a reason for hiding this comment

codereport left a comment

Choose a reason for hiding this comment

bdice commented Mar 7, 2022

bdice commented Mar 8, 2022

bdice commented Mar 8, 2022

Refactor hash functions and `hash_combine` #10379

Refactor hash functions and `hash_combine` #10379

bdice commented Mar 1, 2022 •

edited

Loading

codecov bot commented Mar 2, 2022 •

edited

Loading

jrhemstad Mar 2, 2022 •

edited

Loading