[REVIEW] Add MD5 to existing hashing functionality #5438

rwlee · 2020-06-10T19:21:07Z

Resolves #4989

Refactors hashing support in preparation for MD5 support.

lmeyerov · 2020-07-15T21:15:19Z

@rwlee we may have a project around this, worth syncing?

GPUtester · 2020-07-21T07:53:17Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

GPUtester · 2020-07-21T07:53:17Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

vuule

Looks pretty good.
Some mismatches in code style: lack of auto use, west const, C-style casts. Not a big deal, but it would be great to keep the code consistent.

cpp/include/cudf/detail/utilities/hash_functions.cuh

vuule · 2020-07-22T04:29:55Z

cpp/include/cudf/detail/utilities/hash_functions.cuh

+  uint32_t C = hash_state->hash_value[2];
+  uint32_t D = hash_state->hash_value[3];
+
+  uint32_t* buffer_ints = (uint32_t*)hash_state->buffer;


use reinterpret_cast

Switched to a memcpy

cudf/cpp/include/cudf/detail/utilities/hash_functions.cuh

Line 88 in 3b68a9d

std::memcpy(&buffer_element_as_int, hash_state->buffer + g * 4, 4);

vuule · 2020-07-22T04:31:50Z

cpp/include/cudf/detail/utilities/hash_functions.cuh

+                                         const md5_hash_constants_type* hash_constants,
+                                         const md5_shift_constants_type* shift_constants) const
+  {
+    uint8_t* data = (uint8_t*)&key;


reinterpret_cast here too

cudf/cpp/include/cudf/detail/utilities/hash_functions.cuh

Line 111 in 3b68a9d

uint8_t const* data = reinterpret_cast<uint8_t const*>(&key);

-- switched

vuule · 2020-07-22T04:40:00Z

cpp/include/cudf/detail/utilities/hash_functions.cuh

+  uint64_t full_length = (uint64_t)hash_state->message_length;
+  full_length          = full_length << 3;


Suggested change

uint64_t full_length = (uint64_t)hash_state->message_length;

full_length = full_length << 3;

auto const full_length = (static_cast<uint64_t>hash_state->message_length) << 3;

cudf/cpp/include/cudf/detail/utilities/hash_functions.cuh

Line 138 in 3b68a9d

auto const full_length = (static_cast<uint64_t>(hash_state->message_length)) << 3;

cpp/src/hash/hash_constants.cu

cpp/src/hash/hash_constants.cuh

vuule · 2020-07-22T05:02:18Z

cpp/src/hash/hash_constants.cu

+/**
+ * @copydoc cudf::detail::get_hex_to_char_mapping
+ */
+const hex_to_char_mapping_type* get_hex_to_char_mapping()


looks like other places that require global thread-safe initialization use thread_safe_per_context_cache instead of a std::mutex. Can you use it here too?

Refactored to thread_safe_per_context_cache, then removed entirely in favor of __device__ __constant__

vuule · 2020-07-22T05:04:32Z

cpp/include/cudf/detail/hashing.hpp

                             std::vector<uint32_t> const& initial_hash = {},
                             rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(),
                             cudaStream_t stream                 = 0);

+std::unique_ptr<column> identity_hash(


what is this function?

Would be the wrapper call for the IdentityHash function, I did not end up implementing it and will remove it.

vuule · 2020-07-22T05:06:27Z

cpp/include/cudf/detail/utilities/hash_functions.cuh

+    CUDF_FAIL("Unsupported hash type");
+  }
+
+  void CUDA_HOST_DEVICE_CALLABLE operator()(Key const& key,


Why is this one empty? Might warrant a comment if I'm not missing something trivial.

I did a bunch of testing today to understand and explain this issue better. This operatore() is acting as a catch all case representing default behavior, that allows me to override specific cases like a string_view outside of the struct. Without this un-templated operator function, I get a bunch of compile errors ../include/cudf/detail/utilities/hash_functions.cuh(165): error: no suitable constructor exists to convert from "const __nv_bool" to "cudf::data_type" for a bunch of different types.

Ideally the default behavior should actually be a CUDF_FAIL("Unsupported hash type") but adding that to the empty function causes ../include/cudf/detail/utilities/hash_functions.cuh(173): error: device code does not support exception handling errors.

During my testing this afternoon, non-fixed width types never hit the CUDF_FAIL on line 146 -- the same was seen for other types I was trying to filter out as unsupported column data types. It's clear the current method of filtering out unsupported types doesn't work, any guidance on how to fix this would be appreciated.

Yeah, this isn't right. You're not hitting CUDF_FAIL because you're doing device-side dispatch and calling your MD5Hash object in device code. CUDF_FAIL is a host-only construct. I'm surprised this even compiles.

EDIT: the advice does not stand, see the row_hasher instead as suggested below.

I'd suggest looking at row_hasher here:

cudf/cpp/include/cudf/table/row_operators.cuh

Line 411 in 855e735

class row_hasher {

Required a bit of a rework, the use of size_of rather than sizeof was causing failures with primitive types and were being type dispatched to other operator functions.

rwlee · 2020-08-06T22:36:41Z

Breaks the python murmur3 hash functionality, fixing it now but will likely require another review.

codecov · 2020-08-07T11:04:52Z

Codecov Report

Merging #5438 into branch-0.15 will increase coverage by 0.35%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.15    #5438      +/-   ##
===============================================
+ Coverage        84.08%   84.43%   +0.35%     
===============================================
  Files               80       80              
  Lines            13062    13424     +362     
===============================================
+ Hits             10983    11335     +352     
- Misses            2079     2089      +10

Impacted Files	Coverage Δ
python/cudf/cudf/core/reshape.py	`88.67% <0.00%> (-0.43%)`	⬇️
python/cudf/cudf/io/feather.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/comm/serialize.py	`0.00% <0.00%> (ø)`
python/custreamz/custreamz/kafka.py	`28.88% <0.00%> (ø)`
python/custreamz/custreamz/_version.py	`0.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/_version.py	`0.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_orc.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_json.py	`100.00% <0.00%> (ø)`
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/utils/applyutils.py	`98.75% <0.00%> (+0.02%)`	⬆️
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69af94b...7379b23. Read the comment docs.

jrhemstad · 2020-08-07T21:55:27Z

cpp/include/cudf/hashing.hpp

                             std::vector<uint32_t> const& initial_hash = {},
                             rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

+std::unique_ptr<column> murmur_hash3_32(


We don't want the same function exposed two ways. Either the Python should be updated to call the hash with the HASH_MURMUR3 enum or get rid of the hash API and add a specific MD5 hash. I'd opt for the former.

I mirrored the hash_id enum over to the python side and removed the murmur hash specific function.

cpp/include/cudf/detail/utilities/hash_functions.cuh

kkraus14

Python changes LGTM

cpp/include/cudf/detail/utilities/hash_functions.cuh

karthikeyann · 2020-08-11T17:13:41Z

cpp/src/hash/hash_constants.cu

+/**
+ * @copydoc cudf::detail::get_md5_hash_constants
+ */
+const md5_hash_constants_type* get_md5_hash_constants()


__constant__ md5_shift_constants_type g_md5_shift_constants[16] = { 7, 12, 17, 22, 5, 9, 14, 20, 4, 11, 16, 23, 6, 10, 15, 21,};

https://forums.developer.nvidia.com/t/constant-memory-which-is-device-side-only-avoiding-cudamemcpytosymbol/50804/2

This initialization will happen automatically in device memory (unsure when. likely when cuda context is created)
Limit to Constant memory size is 64 KB.

revans2

This looks OK to me, but I didn't see any java code changes, so not really sure why a java code owner review is needed for this change.

kkraus14 · 2020-08-12T21:52:18Z

rerun tests

rwlee requested a review from a team as a code owner June 10, 2020 19:21

rwlee requested review from karthikeyann and vuule and removed request for a team June 10, 2020 19:21

kkraus14 added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Jun 11, 2020

refactor that details proposed code

b626432

rwlee force-pushed the rwlee/md5 branch from 5ae4e55 to 0d3845c Compare July 21, 2020 07:49

rwlee requested review from a team as code owners July 21, 2020 07:49

rwlee changed the base branch from branch-0.14 to branch-0.15 July 21, 2020 07:52

rwlee added 2 commits July 21, 2020 08:07

Initial MD5 implementation

0d3845c

Modify CHANGELOG and fix copywrite headers

d32552d

rwlee changed the title ~~[WIP] Add MD5 to existing hashing functionality~~ Add MD5 to existing hashing functionality Jul 21, 2020

Merge branch 'branch-0.15' into rwlee/md5

a5ed47f

harrism changed the title ~~Add MD5 to existing hashing functionality~~ [WIP] Add MD5 to existing hashing functionality Jul 21, 2020

harrism assigned rwlee Jul 21, 2020

rwlee changed the title ~~[WIP] Add MD5 to existing hashing functionality~~ [REVIEW] Add MD5 to existing hashing functionality Jul 21, 2020

rwlee added the 3 - Ready for Review Ready for review by team label Jul 21, 2020

style fixes

f4ad66e

vuule requested changes Jul 22, 2020

View reviewed changes

rwlee and others added 2 commits August 4, 2020 21:29

remove extra include

efd8fdb

Merge branch 'branch-0.15' into rwlee/md5

ba9efae

jrhemstad approved these changes Aug 5, 2020

View reviewed changes

fix python api

f3daf4b

rwlee requested review from karthikeyann and vuule August 7, 2020 21:42

jrhemstad requested changes Aug 7, 2020

View reviewed changes

fix code attribution and add license link

f41c8cc

vuule requested changes Aug 7, 2020

View reviewed changes

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved

kkraus14 approved these changes Aug 8, 2020

View reviewed changes

rwlee added 2 commits August 8, 2020 02:07

value naming

614182b

add hash_id to python interface

6aaa7b8

karthikeyann requested changes Aug 11, 2020

View reviewed changes

rwlee added 2 commits August 11, 2020 21:51

Switch hash constant handling

3b68a9d

remove ide insert

7379b23

jrhemstad approved these changes Aug 11, 2020

View reviewed changes

karthikeyann approved these changes Aug 12, 2020

View reviewed changes

vuule approved these changes Aug 12, 2020

View reviewed changes

revans2 approved these changes Aug 12, 2020

View reviewed changes

jrhemstad added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress 3 - Ready for Review Ready for review by team labels Aug 12, 2020

kkraus14 merged commit 8aae2e4 into rapidsai:branch-0.15 Aug 13, 2020

bdice mentioned this pull request Sep 10, 2021

Add SHA-1 and SHA-2 hash functions. #9215

Closed

bdice mentioned this pull request Oct 20, 2021

Refactor MD5 implementation. #9212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add MD5 to existing hashing functionality #5438

[REVIEW] Add MD5 to existing hashing functionality #5438

rwlee commented Jun 10, 2020

lmeyerov commented Jul 15, 2020

GPUtester commented Jul 21, 2020

GPUtester commented Jul 21, 2020

vuule left a comment

vuule Jul 22, 2020

rwlee Aug 11, 2020

vuule Jul 22, 2020

rwlee Aug 11, 2020

vuule Jul 22, 2020

rwlee Aug 11, 2020

vuule Jul 22, 2020

rwlee Aug 11, 2020

vuule Jul 22, 2020

rwlee Jul 24, 2020

vuule Jul 22, 2020

rwlee Jul 24, 2020

jrhemstad Jul 24, 2020

vuule Jul 24, 2020 •

edited

Loading

jrhemstad Jul 24, 2020

rwlee Jul 29, 2020

rwlee commented Aug 6, 2020

codecov bot commented Aug 7, 2020 •

edited

Loading

jrhemstad Aug 7, 2020

rwlee Aug 11, 2020

kkraus14 left a comment

karthikeyann Aug 11, 2020

revans2 left a comment

kkraus14 commented Aug 12, 2020

		uint64_t full_length = (uint64_t)hash_state->message_length;
		full_length = full_length << 3;

	uint64_t full_length = (uint64_t)hash_state->message_length;
	full_length = full_length << 3;
	auto const full_length = (static_cast<uint64_t>hash_state->message_length) << 3;

[REVIEW] Add MD5 to existing hashing functionality #5438

[REVIEW] Add MD5 to existing hashing functionality #5438

Conversation

rwlee commented Jun 10, 2020

lmeyerov commented Jul 15, 2020

GPUtester commented Jul 21, 2020

GPUtester commented Jul 21, 2020

vuule left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule Jul 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rwlee commented Aug 6, 2020

codecov bot commented Aug 7, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kkraus14 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

kkraus14 commented Aug 12, 2020

vuule Jul 24, 2020 •

edited

Loading

codecov bot commented Aug 7, 2020 •

edited

Loading