Spark Decimal128 hashing #9919

rwlee · 2021-12-16T07:07:24Z

Shortens the hashed data by removing preceding zero values -- ensuring the leave a sign bit -- and flipping the endianness before hashing the value.

codecov · 2021-12-16T10:14:32Z

Codecov Report

Merging #9919 (7fe8405) into branch-22.02 (967a333) will decrease coverage by 0.08%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.02    #9919      +/-   ##
================================================
- Coverage         10.49%   10.40%   -0.09%     
================================================
  Files               119      119              
  Lines             20305    20557     +252     
================================================
+ Hits               2130     2139       +9     
- Misses            18175    18418     +243

Impacted Files	Coverage Δ
python/custreamz/custreamz/kafka.py	`29.16% <0.00%> (-0.63%)`	⬇️
python/dask_cudf/dask_cudf/sorting.py	`92.66% <0.00%> (-0.25%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`70.85% <0.00%> (-0.17%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/api/types.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/dtypes.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/scalar.py	`0.00% <0.00%> (ø)`
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f041034...7fe8405. Read the comment docs.

cpp/include/cudf/detail/utilities/hash_functions.cuh

cpp/tests/hashing/hash_test.cpp

bdice

Some early feedback to get discussions going. I'll need to think some more and revisit this PR.

cpp/include/cudf/detail/utilities/hash_functions.cuh

cpp/tests/hashing/hash_test.cpp

cpp/include/cudf/detail/utilities/hash_functions.cuh

harrism · 2022-01-12T00:44:59Z

I left a comment that got lost in a resolved thread:

What about std::byte, using std::to_integer to convert to the integer type needed at any point where we need computation other than the operators supported on std::byte (only supports bitwise operations)?

bdice · 2022-01-12T02:38:57Z

I left a comment that got lost in a resolved thread:

What about std::byte, using std::to_integer to convert to the integer type needed at any point where we need computation other than the operators supported on std::byte (only supports bitwise operations)?

Using std::byte is a great idea, thank you for the suggestion. For scoping purposes, I would propose using std::byte in the decimal128-specific code (shortening) and in the signature of compute_bytes in this PR. We can make that change for the rest of the hash function implementation(s) along with handling some of the other considerations I outlined in my review in a follow-up PR.

rwlee · 2022-01-18T19:25:50Z

Using std::byte is a great idea, thank you for the suggestion. For scoping purposes, I would propose using std::byte in the decimal128-specific code (shortening) and in the signature of compute_bytes in this PR. We can make that change for the rest of the hash function implementation(s) along with handling some of the other considerations I outlined in my review in a follow-up PR.

For the code/kernels I'm touching in this PR I've made the std::byte changes -- can you take a look?

cpp/include/cudf/detail/utilities/hash_functions.cuh

bdice

(review in progress, to be continued)

cpp/include/cudf/detail/utilities/hash_functions.cuh

bdice

Only one other comment for now.

cpp/include/cudf/detail/utilities/hash_functions.cuh

bdice

Final round of suggestions. These are non-blocking and can all be punted to a follow-up PR if needed.

cuDF CI will be unstuck once #10008 is merged, but these suggestions might be good to try locally until that PR is merged, and hopefully CI will pass before code freeze.

cpp/include/cudf/detail/utilities/hash_functions.cuh

rwlee · 2022-01-20T02:14:04Z

Ready to merge as soon as CI passes

rwlee · 2022-01-20T04:04:49Z

@gpucibot merge

Followup to #9919 -- kernel merging and code cleanup for Murmur3 hash. Partial fix for #10081. Benchmarked `compute_bytes` kernel with aligned read vs unaligned read and saw no difference. Looking into it further to confirm that the `uint32_t` construction was doing the same thing implicitly. Due to byte alignment, the string alignment will require the `getblock32` function regardless. Regardless, the benchmarks ran with 100, 103, and 104 byte strings had negligible performance differences. This reflects forced misalignment not negatively impacting the hash speed. Authors: - Ryan Lee (https://github.com/rwlee) - Bradley Dice (https://github.com/bdice) Approvers: - Bradley Dice (https://github.com/bdice) - Christopher Harris (https://github.com/cwharris) URL: #10143

rwlee added 5 commits December 7, 2021 10:05

decimal128 hash kernel no tests

c5868a3

Merge remote-tracking branch 'pub/branch-22.02' into rwlee/2202hash

2f51ec4

Fix decimal hashes above 4 bytes

01c1bf5

clean up and merge common code

46698b3

Merge remote-tracking branch 'pub/branch-22.02' into rwlee/2202hash

c3c0d70

rwlee requested a review from a team as a code owner December 16, 2021 07:07

rwlee requested review from harrism and ttnghia December 16, 2021 07:07

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 16, 2021

rwlee added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS 3 - Ready for Review Ready for review by team labels Dec 16, 2021

style and small comment

512c53a

ttnghia reviewed Dec 16, 2021

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

ttnghia reviewed Dec 16, 2021

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Show resolved Hide resolved

ttnghia reviewed Dec 17, 2021

View reviewed changes

cpp/tests/hashing/hash_test.cpp Outdated Show resolved Hide resolved

jrhemstad requested a review from bdice December 17, 2021 17:03

const variables

0cc7f3d

bdice requested changes Dec 17, 2021

View reviewed changes

Document spark hashing and other cleanup

accb551

rwlee requested review from bdice and ttnghia January 6, 2022 01:43

rwlee added 2 commits January 6, 2022 11:57

Merge remote-tracking branch 'pub/branch-22.02' into rwlee/2202hash

680bd85

fix up merge

4fb4922

nartal1 mentioned this pull request Jan 6, 2022

Add in HashPartitioning support for decimal 128 [databricks] NVIDIA/spark-rapids#4470

Merged

ttnghia reviewed Jan 7, 2022

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

rwlee added 3 commits January 14, 2022 14:11

revert kernel changes and fix tests

aa7e08a

Merge remote-tracking branch 'pub/branch-22.02' into rwlee/2202hash

0a0740b

Switch to std::byte

bfbd47e

rwlee added 2 commits January 18, 2022 12:11

Add 16 byte value test

3f2a371

Merge remote-tracking branch 'pub/branch-22.02' into rwlee/2202hash

13c36af

jrhemstad reviewed Jan 19, 2022

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

jrhemstad reviewed Jan 19, 2022

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

bdice reviewed Jan 19, 2022

View reviewed changes

bdice requested changes Jan 19, 2022

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

rwlee added 2 commits January 19, 2022 15:05

Small cleanup and optimizations

7ef7b00

Merge remote-tracking branch 'pub/branch-22.02' into rwlee/2202hash

0a35b1d

bdice mentioned this pull request Jan 19, 2022

Improvements in hash_functions.cuh #10081

Closed

bdice self-requested a review January 19, 2022 23:29

bdice approved these changes Jan 19, 2022

View reviewed changes

rwlee added 2 commits January 19, 2022 17:10

Remove need for special case handling

a61c1b3

Merge remote-tracking branch 'pub/branch-22.02' into rwlee/2202hash

7fe8405

bdice assigned bdice and rwlee Jan 20, 2022

bdice added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jan 20, 2022

rapids-bot bot merged commit c00f42b into rapidsai:branch-22.02 Jan 20, 2022

rwlee mentioned this pull request Jan 27, 2022

Murmur3 hash kernel cleanup #10143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Decimal128 hashing #9919

Spark Decimal128 hashing #9919

rwlee commented Dec 16, 2021

codecov bot commented Dec 16, 2021 •

edited

Loading

bdice left a comment

harrism commented Jan 12, 2022

bdice commented Jan 12, 2022 •

edited

Loading

rwlee commented Jan 18, 2022

bdice left a comment

bdice left a comment

bdice left a comment

rwlee commented Jan 20, 2022

rwlee commented Jan 20, 2022

Spark Decimal128 hashing #9919

Spark Decimal128 hashing #9919

Conversation

rwlee commented Dec 16, 2021

codecov bot commented Dec 16, 2021 • edited Loading

Codecov Report

bdice left a comment

Choose a reason for hiding this comment

harrism commented Jan 12, 2022

bdice commented Jan 12, 2022 • edited Loading

rwlee commented Jan 18, 2022

bdice left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

rwlee commented Jan 20, 2022

rwlee commented Jan 20, 2022

codecov bot commented Dec 16, 2021 •

edited

Loading

bdice commented Jan 12, 2022 •

edited

Loading