Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MD5 refactoring. #10445

Merged
merged 7 commits into from
Mar 17, 2022
Merged

MD5 refactoring. #10445

merged 7 commits into from
Mar 17, 2022

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Mar 16, 2022

This PR refactors the MD5 hashing functionality. It moves some code that will be shared logic for SHA hashing (#9215), and reduces the diff of that PR to make it easier to review.

@bdice bdice self-assigned this Mar 16, 2022
@github-actions github-actions bot added Java Affects Java cuDF API. Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Mar 16, 2022
@bdice bdice added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 16, 2022
@bdice bdice marked this pull request as ready for review March 16, 2022 17:45
@bdice bdice requested review from a team as code owners March 16, 2022 17:45
Copy link
Contributor Author

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some explanations to help reviewers.

@@ -48,6 +48,107 @@ T __device__ inline normalize_nans_and_zeros(T const& key)
return key;
}

__device__ inline uint32_t rotate_bits_left(uint32_t x, uint32_t r)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved some common functions/classes into this header so that it can be re-used by the SHA hash in a follow-up PR. This code was not edited from the previous file except to replace some int8 arguments with uint32_t arguments for the bit rotation functions. This change aligns with the input data types and expected types for the CUDA intrinsics __funnelshift_l / __funnelshift_r.

HASH_BENCHMARK_DEFINE(HASH_SERIAL_MURMUR3, nulls)
HASH_BENCHMARK_DEFINE(HASH_SPARK_MURMUR3, nulls)
HASH_BENCHMARK_DEFINE(HASH_MD5, nulls)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good chunk of the diffs in this PR are just from moving code around. I put MD5 at the bottom of the list of hashing functions defined in the hash_id enum, and reorganized everything else to have the same order. That means that the MD5 and SHA features can be listed next to each other in a subsequent PR #9215. MD5 and SHA are in the same family of cryptographic hash functions, so it's a logical grouping.

@@ -90,11 +191,6 @@ struct MurmurHash3_32 {
MurmurHash3_32() = default;
constexpr MurmurHash3_32(uint32_t seed) : m_seed(seed) {}

[[nodiscard]] __device__ inline uint32_t rotl32(uint32_t x, uint32_t r) const
{
return __funnelshift_l(x, x, r); // Equivalent to (x << r) | (x >> (32 - r))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need to define rotl32 in MurmurHash3_32 because it's identical to the cudf::detail::rotate_bits_left utility that I moved into this file.

return make_strings_column(
input.num_rows(), std::move(offsets_column), std::move(chars_column), 0, std::move(null_mask));
// Build an output null mask from the logical AND of all input columns' null masks.
auto [null_mask, null_count] = cudf::detail::bitmask_and(input, stream);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the null behavior so that a hash of null input data returns a null value, rather than the hash of an empty byte string. This ensures that users can distinguish between the hash of empty bytes and the hash of null inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I retracted this change in behavior because it caused tests to fail. We do need to fix this so that null inputs produce null outputs, and I will open a separate PR to fix it.

],
)
def test_series_hash_values(method, validation_data):
@pytest.mark.parametrize("method", ["md5"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list of methods will be expanded in #9215.

auto __device__ inline get_element_pointer_and_size(Element const& element)
{
if constexpr (is_fixed_width<Element>() && !is_chrono<Element>()) {
return thrust::make_pair(reinterpret_cast<uint8_t const*>(&element), sizeof(Element));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File this away for future imrpovement, but this should just return a device_span<byte>

Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@codecov
Copy link

codecov bot commented Mar 17, 2022

Codecov Report

Merging #10445 (32ea649) into branch-22.04 (4596244) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.04   #10445      +/-   ##
================================================
+ Coverage         86.13%   86.18%   +0.04%     
================================================
  Files               139      139              
  Lines             22438    22468      +30     
================================================
+ Hits              19328    19363      +35     
+ Misses             3110     3105       -5     
Impacted Files Coverage Δ
python/cudf/cudf/core/tools/numeric.py 89.24% <100.00%> (+0.11%) ⬆️
python/dask_cudf/dask_cudf/backends.py 86.44% <100.00%> (+1.47%) ⬆️
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/column/string.py 88.39% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.57% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/column/numerical.py 95.28% <0.00%> (+0.29%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
python/cudf/cudf/core/column/lists.py 90.56% <0.00%> (+0.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4ce5d5...32ea649. Read the comment docs.

@bdice
Copy link
Contributor Author

bdice commented Mar 17, 2022

rerun tests

Copy link
Contributor

@jbrennan333 jbrennan333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum change in HashType.java looks good to me.

@bdice
Copy link
Contributor Author

bdice commented Mar 17, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 04933a2 into rapidsai:branch-22.04 Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants