[FEA] row-wise hashing using common hash functions like MD5 & SHA-2 #4989

rwlee · 2020-04-22T18:18:56Z

Is your feature request related to a problem? Please describe.
We would like to use cudf to hash each row of a column, expanding the existing hash functionality to use other common hash functions like md5, sha2, etc.

Describe the solution you'd like
The existing hash functionality exists here:
https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/include/cudf/hashing.hpp#L34

Ideally this would be enhanced to support an additional optional argument that specifies which hash function to use.

Additional context
This feature request is somewhat similar to #4913 but hashes each row rather than hashing an entire column to a single value.

revans2 · 2020-04-22T19:18:10Z

One of the hard parts with this is that most hashes are more than 64 bits.

MD5 is a 128bit hash
https://en.wikipedia.org/wiki/MD5
SHA1 is 160 bits
SHA2 is actually several different widths that we would want to support.
224, 256, 384, and 512.

So there is no simple way to return these types of hashes with existing cudf data types in a binary form. Spark returns a hex encoded string for these results, which would be one option. Or you could wait for array support to go in and return a binary array that is the correct length.

Hash functions that we would like to see that do fit into a long/int are CRC32, MurMur3 and xxhash64

lmeyerov · 2020-07-15T21:12:54Z

Growing the use case a bit for DPI-style & compression use cases:

Hashing / multi-hashing:

-- hash: rabin-karp, bloom filter
-- size: 8b - 128b (can represent a bytestream as a single column?)

Collision checking:

-- Given a stream, what are the substring matches? <-- feels like categorical encoding with the option to throw out appears-only-once

rwlee added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Apr 22, 2020

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Apr 23, 2020

rwlee self-assigned this Jun 2, 2020

rwlee mentioned this issue Jun 10, 2020

[REVIEW] Add MD5 to existing hashing functionality #5438

Merged

kkraus14 closed this as completed in #5438 Aug 13, 2020

rwlee mentioned this issue Aug 18, 2020

[WIP] SHA-1 and SHA-2 hashes #6020

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] row-wise hashing using common hash functions like MD5 & SHA-2 #4989

[FEA] row-wise hashing using common hash functions like MD5 & SHA-2 #4989

rwlee commented Apr 22, 2020

revans2 commented Apr 22, 2020 •

edited

Loading

lmeyerov commented Jul 15, 2020 •

edited

Loading

[FEA] row-wise hashing using common hash functions like MD5 & SHA-2 #4989

[FEA] row-wise hashing using common hash functions like MD5 & SHA-2 #4989

Comments

rwlee commented Apr 22, 2020

revans2 commented Apr 22, 2020 • edited Loading

lmeyerov commented Jul 15, 2020 • edited Loading

revans2 commented Apr 22, 2020 •

edited

Loading

lmeyerov commented Jul 15, 2020 •

edited

Loading