Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] SHA-1 and SHA-2 hashes #6020

Closed
wants to merge 2 commits into from
Closed

Conversation

rwlee
Copy link
Contributor

@rwlee rwlee commented Aug 18, 2020

Resolves #4989

Linked issue was resolved when MD5 functionality went in. This PR adds SHA hashes to existing hashing functionality.

In this WIP PR, the SHA2 family of hashes is split into 2 separate kernels -- 1 for SHA-224 + SHA-256 and the other for SHA-384 and SHA-512. The underlying algorithm is very similar, but uses a different word size, buffer size, iteration counts, and shift constants. Given these differences, is it reasonable to split the algorithm despite how similar the resulting code will be?

@rwlee rwlee requested a review from a team as a code owner August 18, 2020 06:43
@rwlee rwlee requested review from cwharris and davidwendt August 18, 2020 06:43
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@rwlee rwlee added 2 - In Progress Currently a work in progress Spark Functionality that helps Spark RAPIDS libcudf Affects libcudf (C++/CUDA) code. labels Aug 18, 2020
uint8_t buffer[64];
};

__device__ __constant__ sha256_word_type sha256_hash_constants[64] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about using up all of the available constant memory with all of these lookup tables. There's only 64KB available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what will the access pattern be across threads in the same warp? If it is random, rather than uniform, I think just __device__ const would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to change the approach, was asked to use __device__ __constant__ in the md5 hashing. Should I swap back to the thread_safe_per_context_cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, Mark's reply didn't show up in my initial comment.

Access pattern for each thread is sequential, but there's nothing synchronizing each thread across the warp -- so I think the access within a warp would effectively be random.

}

std::unique_ptr<column> sha256_base(table_view const& input,
bool truncate_output,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use a bool parameter. Use an enum class with a descriptive name so it's more obvious at the callsite what this controls.

@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@harrism
Copy link
Member

harrism commented Oct 6, 2020

Moving to 0.17

@harrism harrism changed the base branch from branch-0.16 to branch-0.17 October 6, 2020 04:53
@harrism
Copy link
Member

harrism commented Nov 23, 2020

@rwlee is this still being developed?

@jrhemstad
Copy link
Contributor

Moving to 0.19.

@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@harrism
Copy link
Member

harrism commented Jul 22, 2021

@rwlee @sameerz this PR is very stale. Can it be closed, or do you still have plans for it?

@sameerz
Copy link
Contributor

sameerz commented Jul 22, 2021

@harrism I think it can be closed for now. We can reopen once we get back to this.

rapids-bot bot pushed a commit that referenced this pull request Oct 21, 2021
This PR refactors the MD5 hash implementation in libcudf. I used the MD5 code as a reference while working on SHA (extending #6020, PR #9215 to follow).

List of high-level changes:
- I moved the implementation of `MD5Hash` and related logic from `include/cudf/detail/utilities/hash_functions.cuh` to `src/hash/md5_hash.cu` because it is only used in that file and nowhere else. We don't need to include and build MD5 in `hash_functions.cuh` for all the collections/sorting/groupby tools that only use Murmur3 variants and `IdentityHash`. (This will be a bigger deal once we add the SHA hash functions, soon to follow this PR, because the size of `hash_functions.cuh` would be substantially larger without this separation.)
- I removed an `MD5Hash` constructor that accepted and stored a seed whose value was unused.
- Improved use of namespaces.
- Use named constants instead of magic numbers.
- Introduced a `hash_circular_buffer` and refactored dispatch logic.

No changes were made to the feature scope or public APIs of the MD5 feature, so existing unit tests and bindings should remain the same.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Mark Harris (https://github.com/harrism)
  - Jake Hemstad (https://github.com/jrhemstad)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9212
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] row-wise hashing using common hash functions like MD5 & SHA-2
5 participants