Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Spark Murmur3 hash functionality #7024

Merged
merged 28 commits into from
Jan 4, 2021

Conversation

rwlee
Copy link
Contributor

@rwlee rwlee commented Dec 16, 2020

Resolves #6863

Expands existing murmur3 hashing functionality to match Spark's murmur3 hashing algorithm by modifying tail processing for unaligned bytes and processing booleans as 32bit integers rather than singular bytes.

@rwlee rwlee added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 16, 2020
@rwlee rwlee requested a review from a team as a code owner December 16, 2020 22:28
@rwlee rwlee changed the title [REVIEW] [REVIEW] Spark Murmur3 hash functionality Dec 16, 2020
@codecov
Copy link

codecov bot commented Dec 17, 2020

Codecov Report

Merging #7024 (a4e95fe) into branch-0.18 (ca1a4d6) will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #7024      +/-   ##
===============================================
+ Coverage        82.09%   82.11%   +0.02%     
===============================================
  Files               97       97              
  Lines            16474    16477       +3     
===============================================
+ Hits             13524    13530       +6     
+ Misses            2950     2947       -3     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/hash_vocab_utils.py 100.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 91.48% <0.00%> (+4.25%) ⬆️
python/cudf/cudf/utils/gpu_utils.py 58.53% <0.00%> (+4.87%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca1a4d6...a4e95fe. Read the comment docs.

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
@nvdbaranec nvdbaranec self-requested a review December 18, 2020 19:17
@harrism harrism added 6 - Okay to Auto-Merge and removed 3 - Ready for Review Ready for review by team labels Jan 4, 2021
@rapids-bot rapids-bot bot merged commit 8860baf into rapidsai:branch-0.18 Jan 4, 2021
rapids-bot bot pushed a commit that referenced this pull request Mar 24, 2021
#7024 added a Spark variant of Murmur3 hashing, but it is inconsistent with Apache Spark's hash calculations in a few areas:
- `-0.0` and `0.0` are not treated the same by Apache Spark for floats and doubles
- byte and short integral values are upcast to a 32-bit unsigned int (i.e.: zero-filled) before calculating the hash

In addition libcudf allows hashing of timestamp columns but the JNI bindings asserted if timestamp columns were passed in, disabling the ability to hash on timestamps directly.

Authors:
  - Jason Lowe (@jlowe)

Approvers:
  - Nghia Truong (@ttnghia)
  - Jake Hemstad (@jrhemstad)
  - Alessandro Bellina (@abellina)
  - MithunR (@mythrocks)
  - Robert (Bobby) Evans (@revans2)

URL: #7672
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Murmur3 that matches spark hashing for partitioning
6 participants