Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add MD5 to existing hashing functionality #5438

Merged
merged 20 commits into from
Aug 13, 2020

Conversation

rwlee
Copy link
Contributor

@rwlee rwlee commented Jun 10, 2020

Resolves #4989

Refactors hashing support in preparation for MD5 support.

@rwlee rwlee requested a review from a team as a code owner June 10, 2020 19:21
@rwlee rwlee requested review from karthikeyann and vuule and removed request for a team June 10, 2020 19:21
@kkraus14 kkraus14 added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Jun 11, 2020
@lmeyerov
Copy link

@rwlee we may have a project around this, worth syncing?

@rwlee rwlee requested review from a team as code owners July 21, 2020 07:49
@rwlee rwlee changed the base branch from branch-0.14 to branch-0.15 July 21, 2020 07:52
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

1 similar comment
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@rwlee rwlee changed the title [WIP] Add MD5 to existing hashing functionality Add MD5 to existing hashing functionality Jul 21, 2020
@harrism harrism changed the title Add MD5 to existing hashing functionality [WIP] Add MD5 to existing hashing functionality Jul 21, 2020
@rwlee rwlee changed the title [WIP] Add MD5 to existing hashing functionality [REVIEW] Add MD5 to existing hashing functionality Jul 21, 2020
@rwlee rwlee added the 3 - Ready for Review Ready for review by team label Jul 21, 2020
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good.
Some mismatches in code style: lack of auto use, west const, C-style casts. Not a big deal, but it would be great to keep the code consistent.

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
uint32_t C = hash_state->hash_value[2];
uint32_t D = hash_state->hash_value[3];

uint32_t* buffer_ints = (uint32_t*)hash_state->buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use reinterpret_cast

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to a memcpy

std::memcpy(&buffer_element_as_int, hash_state->buffer + g * 4, 4);

const md5_hash_constants_type* hash_constants,
const md5_shift_constants_type* shift_constants) const
{
uint8_t* data = (uint8_t*)&key;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reinterpret_cast here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint8_t const* data = reinterpret_cast<uint8_t const*>(&key);
-- switched

Comment on lines 199 to 200
uint64_t full_length = (uint64_t)hash_state->message_length;
full_length = full_length << 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint64_t full_length = (uint64_t)hash_state->message_length;
full_length = full_length << 3;
auto const full_length = (static_cast<uint64_t>hash_state->message_length) << 3;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto const full_length = (static_cast<uint64_t>(hash_state->message_length)) << 3;

cpp/src/hash/hash_constants.cu Outdated Show resolved Hide resolved
cpp/src/hash/hash_constants.cuh Outdated Show resolved Hide resolved
/**
* @copydoc cudf::detail::get_hex_to_char_mapping
*/
const hex_to_char_mapping_type* get_hex_to_char_mapping()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like other places that require global thread-safe initialization use thread_safe_per_context_cache instead of a std::mutex. Can you use it here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to thread_safe_per_context_cache, then removed entirely in favor of __device__ __constant__

std::vector<uint32_t> const& initial_hash = {},
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(),
cudaStream_t stream = 0);

std::unique_ptr<column> identity_hash(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be the wrapper call for the IdentityHash function, I did not end up implementing it and will remove it.

CUDF_FAIL("Unsupported hash type");
}

void CUDA_HOST_DEVICE_CALLABLE operator()(Key const& key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this one empty? Might warrant a comment if I'm not missing something trivial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a bunch of testing today to understand and explain this issue better. This operatore() is acting as a catch all case representing default behavior, that allows me to override specific cases like a string_view outside of the struct. Without this un-templated operator function, I get a bunch of compile errors ../include/cudf/detail/utilities/hash_functions.cuh(165): error: no suitable constructor exists to convert from "const __nv_bool" to "cudf::data_type" for a bunch of different types.

Ideally the default behavior should actually be a CUDF_FAIL("Unsupported hash type") but adding that to the empty function causes ../include/cudf/detail/utilities/hash_functions.cuh(173): error: device code does not support exception handling errors.

During my testing this afternoon, non-fixed width types never hit the CUDF_FAIL on line 146 -- the same was seen for other types I was trying to filter out as unsupported column data types. It's clear the current method of filtering out unsupported types doesn't work, any guidance on how to fix this would be appreciated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this isn't right. You're not hitting CUDF_FAIL because you're doing device-side dispatch and calling your MD5Hash object in device code. CUDF_FAIL is a host-only construct. I'm surprised this even compiles.

Copy link
Contributor

@vuule vuule Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: the advice does not stand, see the row_hasher instead as suggested below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest looking at row_hasher here:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required a bit of a rework, the use of size_of rather than sizeof was causing failures with primitive types and were being type dispatched to other operator functions.

@rwlee
Copy link
Contributor Author

rwlee commented Aug 6, 2020

Breaks the python murmur3 hash functionality, fixing it now but will likely require another review.

@codecov
Copy link

codecov bot commented Aug 7, 2020

Codecov Report

Merging #5438 into branch-0.15 will increase coverage by 0.35%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.15    #5438      +/-   ##
===============================================
+ Coverage        84.08%   84.43%   +0.35%     
===============================================
  Files               80       80              
  Lines            13062    13424     +362     
===============================================
+ Hits             10983    11335     +352     
- Misses            2079     2089      +10     
Impacted Files Coverage Δ
python/cudf/cudf/core/reshape.py 88.67% <0.00%> (-0.43%) ⬇️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/custreamz/custreamz/kafka.py 28.88% <0.00%> (ø)
python/custreamz/custreamz/_version.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_orc.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_json.py 100.00% <0.00%> (ø)
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <0.00%> (ø)
python/cudf/cudf/utils/applyutils.py 98.75% <0.00%> (+0.02%) ⬆️
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69af94b...7379b23. Read the comment docs.

@rwlee rwlee requested review from karthikeyann and vuule August 7, 2020 21:42
std::vector<uint32_t> const& initial_hash = {},
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

std::unique_ptr<column> murmur_hash3_32(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want the same function exposed two ways. Either the Python should be updated to call the hash with the HASH_MURMUR3 enum or get rid of the hash API and add a specific MD5 hash. I'd opt for the former.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mirrored the hash_id enum over to the python side and removed the murmur hash specific function.

Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python changes LGTM

cpp/include/cudf/detail/utilities/hash_functions.cuh Outdated Show resolved Hide resolved
/**
* @copydoc cudf::detail::get_md5_hash_constants
*/
const md5_hash_constants_type* get_md5_hash_constants()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__constant__ md5_shift_constants_type g_md5_shift_constants[16] 
   = {  7,  12,  17,  22,  5,  9,  14,  20,  4,  11,  16,  23,  6,  10,  15,  21,};

https://forums.developer.nvidia.com/t/constant-memory-which-is-device-side-only-avoiding-cudamemcpytosymbol/50804/2

This initialization will happen automatically in device memory (unsure when. likely when cuda context is created)
Limit to Constant memory size is 64 KB.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks OK to me, but I didn't see any java code changes, so not really sure why a java code owner review is needed for this change.

@jrhemstad jrhemstad added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress 3 - Ready for Review Ready for review by team labels Aug 12, 2020
@kkraus14
Copy link
Collaborator

rerun tests

@kkraus14 kkraus14 merged commit 8aae2e4 into rapidsai:branch-0.15 Aug 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] row-wise hashing using common hash functions like MD5 & SHA-2
8 participants