Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark Decimal128 hashing #9919

Merged
merged 22 commits into from
Jan 20, 2022
Merged

Conversation

rwlee
Copy link
Contributor

@rwlee rwlee commented Dec 16, 2021

cudf work for NVIDIA/spark-rapids#3878

Shortens the hashed data by removing preceding zero values -- ensuring the leave a sign bit -- and flipping the endianness before hashing the value.

@rwlee rwlee requested a review from a team as a code owner December 16, 2021 07:07
@rwlee rwlee requested review from harrism and ttnghia December 16, 2021 07:07
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 16, 2021
@rwlee rwlee added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS 3 - Ready for Review Ready for review by team labels Dec 16, 2021
@codecov
Copy link

codecov bot commented Dec 16, 2021

Codecov Report

Merging #9919 (7fe8405) into branch-22.02 (967a333) will decrease coverage by 0.08%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9919      +/-   ##
================================================
- Coverage         10.49%   10.40%   -0.09%     
================================================
  Files               119      119              
  Lines             20305    20557     +252     
================================================
+ Hits               2130     2139       +9     
- Misses            18175    18418     +243     
Impacted Files Coverage Δ
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 92.66% <0.00%> (-0.25%) ⬇️
python/dask_cudf/dask_cudf/core.py 70.85% <0.00%> (-0.17%) ⬇️
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/dtypes.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/scalar.py 0.00% <0.00%> (ø)
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f041034...7fe8405. Read the comment docs.

@jrhemstad jrhemstad requested a review from bdice December 17, 2021 17:03
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early feedback to get discussions going. I'll need to think some more and revisit this PR.

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/tests/hashing/hash_test.cpp Outdated Show resolved Hide resolved
@rwlee rwlee requested review from bdice and ttnghia January 6, 2022 01:43
@harrism
Copy link
Member

harrism commented Jan 12, 2022

I left a comment that got lost in a resolved thread:

What about std::byte, using std::to_integer to convert to the integer type needed at any point where we need computation other than the operators supported on std::byte (only supports bitwise operations)?

@bdice
Copy link
Contributor

bdice commented Jan 12, 2022

I left a comment that got lost in a resolved thread:

What about std::byte, using std::to_integer to convert to the integer type needed at any point where we need computation other than the operators supported on std::byte (only supports bitwise operations)?

Using std::byte is a great idea, thank you for the suggestion. For scoping purposes, I would propose using std::byte in the decimal128-specific code (shortening) and in the signature of compute_bytes in this PR. We can make that change for the rest of the hash function implementation(s) along with handling some of the other considerations I outlined in my review in a follow-up PR.

@rwlee
Copy link
Contributor Author

rwlee commented Jan 18, 2022

Using std::byte is a great idea, thank you for the suggestion. For scoping purposes, I would propose using std::byte in the decimal128-specific code (shortening) and in the signature of compute_bytes in this PR. We can make that change for the rest of the hash function implementation(s) along with handling some of the other considerations I outlined in my review in a follow-up PR.

For the code/kernels I'm touching in this PR I've made the std::byte changes -- can you take a look?

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(review in progress, to be continued)

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one other comment for now.

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
@bdice bdice self-requested a review January 19, 2022 23:29
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final round of suggestions. These are non-blocking and can all be punted to a follow-up PR if needed.

cuDF CI will be unstuck once #10008 is merged, but these suggestions might be good to try locally until that PR is merged, and hopefully CI will pass before code freeze.

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
@rwlee
Copy link
Contributor Author

rwlee commented Jan 20, 2022

Ready to merge as soon as CI passes

@bdice bdice assigned bdice and rwlee Jan 20, 2022
@bdice bdice added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jan 20, 2022
@rwlee
Copy link
Contributor Author

rwlee commented Jan 20, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit c00f42b into rapidsai:branch-22.02 Jan 20, 2022
rapids-bot bot pushed a commit that referenced this pull request Feb 7, 2022
Followup to #9919 -- kernel merging and code cleanup for Murmur3 hash.

Partial fix for #10081.

Benchmarked `compute_bytes` kernel with aligned read vs unaligned read and saw no difference. Looking into it further to confirm that the `uint32_t` construction was doing the same thing implicitly.

Due to byte alignment, the string alignment will require the `getblock32` function regardless. Regardless, the benchmarks ran with 100, 103, and 104 byte strings had negligible performance differences. This reflects forced misalignment not negatively impacting the hash speed.

Authors:
  - Ryan Lee (https://github.com/rwlee)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Christopher Harris (https://github.com/cwharris)

URL: #10143
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants