[REVIEW] Spark Murmur3 hash functionality #7024

rwlee · 2020-12-16T22:28:43Z

Resolves #6863

Expands existing murmur3 hashing functionality to match Spark's murmur3 hashing algorithm by modifying tail processing for unaligned bytes and processing booleans as 32bit integers rather than singular bytes.

codecov · 2020-12-17T01:30:36Z

Codecov Report

Merging #7024 (a4e95fe) into branch-0.18 (ca1a4d6) will increase coverage by 0.02%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.18    #7024      +/-   ##
===============================================
+ Coverage        82.09%   82.11%   +0.02%     
===============================================
  Files               97       97              
  Lines            16474    16477       +3     
===============================================
+ Hits             13524    13530       +6     
+ Misses            2950     2947       -3

Impacted Files	Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/hash_vocab_utils.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/core/abc.py	`91.48% <0.00%> (+4.25%)`	⬆️
python/cudf/cudf/utils/gpu_utils.py	`58.53% <0.00%> (+4.87%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca1a4d6...a4e95fe. Read the comment docs.

java/src/main/java/ai/rapids/cudf/ColumnVector.java

cpp/include/cudf/detail/utilities/hash_functions.cuh

@jlowe

#7024 added a Spark variant of Murmur3 hashing, but it is inconsistent with Apache Spark's hash calculations in a few areas: - `-0.0` and `0.0` are not treated the same by Apache Spark for floats and doubles - byte and short integral values are upcast to a 32-bit unsigned int (i.e.: zero-filled) before calculating the hash In addition libcudf allows hashing of timestamp columns but the JNI bindings asserted if timestamp columns were passed in, disabling the ability to hash on timestamps directly. Authors: - Jason Lowe (@jlowe) Approvers: - Nghia Truong (@ttnghia) - Jake Hemstad (@jrhemstad) - Alessandro Bellina (@abellina) - MithunR (@mythrocks) - Robert (Bobby) Evans (@revans2) URL: #7672

rwlee and others added 24 commits November 16, 2020 16:33

Serial murmur3 hash with configurable seed

ab25cc8

murmur3 testing, JNI, and kernel fix

999adcf

Merge remote-tracking branch 'pub/branch-0.17' into rwlee/sparkmurmur3

4f6b7ed

Update python API

13ed6ce

Fix md5 rebase error

da8fbb1

Fix python cudf hash function mapping

de90192

update changelog

66ca93d

Merge branch 'branch-0.17' into rwlee/sparkmurmur3

cbd7f3f

resolve rebase switch of stream and mr arg order

c8a724f

PR fixes, first set

62c69df

Merge remote-tracking branch 'pub/branch-0.17' into rwlee/sparkmurmur3

f47652d

fix java tests

5f1f6e5

Merge remote-tracking branch 'pub/branch-0.17' into rwlee/sparkmurmur3

638eda2

Reconfigure thrust calls

63e9eea

first pass, spark specific hash

cdbd5bb

Fix tail processing

be831d7

Merge branch 'branch-0.18' into rwlee/sparkspecific

cdc41b8

remove extra python def

9364601

Fix tail processing and update tests

5197578

Merge branch 'branch-0.18' into rwlee/sparkspecific

be53a1e

fix cast and const formatting

f884e96

Merge branch 'branch-0.18' into rwlee/sparkspecific

ebec0c3

intermediate state, last fixes

4cd4359

Fix boolean hashing

93cb82c

rwlee added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 16, 2020

rwlee requested a review from a team as a code owner December 16, 2020 22:28

rwlee requested review from isVoid and rgsl888prabhu December 16, 2020 22:28

formatting fix

2ec035c

rwlee changed the title ~~[REVIEW]~~ [REVIEW] Spark Murmur3 hash functionality Dec 16, 2020

rwlee mentioned this pull request Dec 16, 2020

Spark SQL hash function using murmur3 NVIDIA/spark-rapids#1207

Merged

revans2 reviewed Dec 17, 2020

View reviewed changes

java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved

java/src/main/java/ai/rapids/cudf/ColumnVector.java Show resolved Hide resolved

jrhemstad reviewed Dec 17, 2020

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

jrhemstad reviewed Dec 17, 2020

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

jrhemstad requested changes Dec 17, 2020

View reviewed changes

sameerz mentioned this pull request Dec 17, 2020

[FEA] have murmur3 hash function that matches exactly with spark NVIDIA/spark-rapids#937

Closed

remove host callability

0bad7c0

nvdbaranec requested changes Dec 18, 2020

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Show resolved Hide resolved

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

cpp/include/cudf/detail/utilities/hash_functions.cuh Show resolved Hide resolved

jrhemstad approved these changes Dec 18, 2020

View reviewed changes

trailing const formatting

4b6db38

nvdbaranec self-requested a review December 18, 2020 19:17

nvdbaranec approved these changes Dec 18, 2020

View reviewed changes

Merge branch 'branch-0.18' into rwlee/sparkspecific

a4e95fe

revans2 approved these changes Jan 4, 2021

View reviewed changes

galipremsagar approved these changes Jan 4, 2021

View reviewed changes

harrism removed request for harrism, isVoid and rgsl888prabhu January 4, 2021 20:33

harrism added 6 - Okay to Auto-Merge and removed 3 - Ready for Review Ready for review by team labels Jan 4, 2021

rapids-bot bot merged commit 8860baf into rapidsai:branch-0.18 Jan 4, 2021

jlowe mentioned this pull request Mar 22, 2021

Fix SparkMurmurHash3_32 hash inconsistencies with Apache Spark #7672

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Spark Murmur3 hash functionality #7024

[REVIEW] Spark Murmur3 hash functionality #7024

rwlee commented Dec 16, 2020

codecov bot commented Dec 17, 2020 •

edited

Loading

[REVIEW] Spark Murmur3 hash functionality #7024

[REVIEW] Spark Murmur3 hash functionality #7024

Conversation

rwlee commented Dec 16, 2020

codecov bot commented Dec 17, 2020 • edited Loading

Codecov Report

codecov bot commented Dec 17, 2020 •

edited

Loading