Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] row-wise hashing using common hash functions like MD5 & SHA-2 #4989

Closed
rwlee opened this issue Apr 22, 2020 · 2 comments · Fixed by #5438
Closed

[FEA] row-wise hashing using common hash functions like MD5 & SHA-2 #4989

rwlee opened this issue Apr 22, 2020 · 2 comments · Fixed by #5438
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@rwlee
Copy link
Contributor

rwlee commented Apr 22, 2020

Is your feature request related to a problem? Please describe.
We would like to use cudf to hash each row of a column, expanding the existing hash functionality to use other common hash functions like md5, sha2, etc.

Describe the solution you'd like
The existing hash functionality exists here:
https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/include/cudf/hashing.hpp#L34

Ideally this would be enhanced to support an additional optional argument that specifies which hash function to use.

Additional context
This feature request is somewhat similar to #4913 but hashes each row rather than hashing an entire column to a single value.

@rwlee rwlee added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Apr 22, 2020
@revans2
Copy link
Contributor

revans2 commented Apr 22, 2020

One of the hard parts with this is that most hashes are more than 64 bits.

MD5 is a 128bit hash
https://en.wikipedia.org/wiki/MD5
SHA1 is 160 bits
SHA2 is actually several different widths that we would want to support.
224, 256, 384, and 512.

So there is no simple way to return these types of hashes with existing cudf data types in a binary form. Spark returns a hex encoded string for these results, which would be one option. Or you could wait for array support to go in and return a binary array that is the correct length.

Hash functions that we would like to see that do fit into a long/int are CRC32, MurMur3 and xxhash64

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Apr 23, 2020
@rwlee rwlee self-assigned this Jun 2, 2020
@lmeyerov
Copy link

lmeyerov commented Jul 15, 2020

Growing the use case a bit for DPI-style & compression use cases:

Hashing / multi-hashing:

-- hash: rabin-karp, bloom filter
-- size: 8b - 128b (can represent a bytestream as a single column?)

Collision checking:

-- Given a stream, what are the substring matches? <-- feels like categorical encoding with the option to throw out appears-only-once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
4 participants