[WIP] SHA-1 and SHA-2 hashes #6020

rwlee · 2020-08-18T06:43:54Z

Resolves #4989

Linked issue was resolved when MD5 functionality went in. This PR adds SHA hashes to existing hashing functionality.

In this WIP PR, the SHA2 family of hashes is split into 2 separate kernels -- 1 for SHA-224 + SHA-256 and the other for SHA-384 and SHA-512. The underlying algorithm is very similar, but uses a different word size, buffer size, iteration counts, and shift constants. Given these differences, is it reasonable to split the algorithm despite how similar the resulting code will be?

GPUtester · 2020-08-18T06:44:22Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

jrhemstad · 2020-08-18T13:21:48Z

cpp/src/hash/hash_constants.hpp

+  uint8_t buffer[64];
+};
+
+__device__ __constant__ sha256_word_type sha256_hash_constants[64] = {


I'm concerned about using up all of the available constant memory with all of these lookup tables. There's only 64KB available.

Also, what will the access pattern be across threads in the same warp? If it is random, rather than uniform, I think just __device__ const would be better.

Happy to change the approach, was asked to use __device__ __constant__ in the md5 hashing. Should I swap back to the thread_safe_per_context_cache?

Oops, Mark's reply didn't show up in my initial comment.

Access pattern for each thread is sequential, but there's nothing synchronizing each thread across the warp -- so I think the access within a warp would effectively be random.

jrhemstad · 2020-08-18T13:22:47Z

cpp/src/hash/hashing.cu

+}
+
+std::unique_ptr<column> sha256_base(table_view const& input,
+                                    bool truncate_output,


Don't use a bool parameter. Use an enum class with a descriptive name so it's more obvious at the callsite what this controls.

GPUtester · 2020-08-29T02:18:16Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

harrism · 2020-10-06T04:53:06Z

Moving to 0.17

harrism · 2020-11-23T03:36:59Z

@rwlee is this still being developed?

jrhemstad · 2021-02-03T15:56:44Z

Moving to 0.19.

github-actions · 2021-03-14T19:12:53Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

harrism · 2021-07-22T00:35:46Z

@rwlee @sameerz this PR is very stale. Can it be closed, or do you still have plans for it?

sameerz · 2021-07-22T00:39:21Z

@harrism I think it can be closed for now. We can reopen once we get back to this.

This PR refactors the MD5 hash implementation in libcudf. I used the MD5 code as a reference while working on SHA (extending #6020, PR #9215 to follow). List of high-level changes: - I moved the implementation of `MD5Hash` and related logic from `include/cudf/detail/utilities/hash_functions.cuh` to `src/hash/md5_hash.cu` because it is only used in that file and nowhere else. We don't need to include and build MD5 in `hash_functions.cuh` for all the collections/sorting/groupby tools that only use Murmur3 variants and `IdentityHash`. (This will be a bigger deal once we add the SHA hash functions, soon to follow this PR, because the size of `hash_functions.cuh` would be substantially larger without this separation.) - I removed an `MD5Hash` constructor that accepted and stored a seed whose value was unused. - Improved use of namespaces. - Use named constants instead of magic numbers. - Introduced a `hash_circular_buffer` and refactored dispatch logic. No changes were made to the feature scope or public APIs of the MD5 feature, so existing unit tests and bindings should remain the same. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Mark Harris (https://github.com/harrism) - Jake Hemstad (https://github.com/jrhemstad) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9212

initial sha structure

7e5dec8

rwlee requested a review from a team as a code owner August 18, 2020 06:43

rwlee requested review from cwharris and davidwendt August 18, 2020 06:43

rwlee added 2 - In Progress Currently a work in progress Spark Functionality that helps Spark RAPIDS libcudf Affects libcudf (C++/CUDA) code. labels Aug 18, 2020

jrhemstad requested changes Aug 18, 2020

View reviewed changes

first past sha1 implementation

e4e5c14

rwlee force-pushed the rwlee/sha branch from f1aa1cd to e4e5c14 Compare August 29, 2020 02:16

harrism changed the base branch from branch-0.16 to branch-0.17 October 6, 2020 04:53

github-actions bot added the inactive-30d label Mar 14, 2021

sameerz closed this Jul 22, 2021

davidwendt mentioned this pull request Jul 28, 2021

[FEA] Add support for SHA256 and SHA512 to cudf::hash #8641

Closed

This was referenced Sep 10, 2021

Refactor MD5 implementation. #9212

Merged

Add SHA-1 and SHA-2 hash functions. #9215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] SHA-1 and SHA-2 hashes #6020

[WIP] SHA-1 and SHA-2 hashes #6020

rwlee commented Aug 18, 2020

GPUtester commented Aug 18, 2020

jrhemstad Aug 18, 2020

harrism Aug 19, 2020

rwlee Aug 21, 2020

rwlee Aug 22, 2020

jrhemstad Aug 18, 2020

GPUtester commented Aug 29, 2020

harrism commented Oct 6, 2020

harrism commented Nov 23, 2020

jrhemstad commented Feb 3, 2021

github-actions bot commented Mar 14, 2021

harrism commented Jul 22, 2021

sameerz commented Jul 22, 2021

[WIP] SHA-1 and SHA-2 hashes #6020

[WIP] SHA-1 and SHA-2 hashes #6020

Conversation

rwlee commented Aug 18, 2020

GPUtester commented Aug 18, 2020

jrhemstad Aug 18, 2020

Choose a reason for hiding this comment

harrism Aug 19, 2020

Choose a reason for hiding this comment

rwlee Aug 21, 2020

Choose a reason for hiding this comment

rwlee Aug 22, 2020

Choose a reason for hiding this comment

jrhemstad Aug 18, 2020

Choose a reason for hiding this comment

GPUtester commented Aug 29, 2020

harrism commented Oct 6, 2020

harrism commented Nov 23, 2020

jrhemstad commented Feb 3, 2021

github-actions bot commented Mar 14, 2021

harrism commented Jul 22, 2021

sameerz commented Jul 22, 2021