[QST] Efficient minhashing with cuDF #12950

ayushdg · 2023-03-15T00:43:19Z

What is your question?
Minhashing or locality sensitive hashing is a popular technique used to group similar documents based on the similarity b/w words/shingles in those documents.

Right now I have an approach that computes the minhashes for documents as follows:
Series of documents -> n-grams(shingles)-> hash_values -> groupby+min

Documents

0            doc 1
1    this is doc 2
2            doc 3
Name: doc_text, dtype: object

exploded n-grams

0     doc 
0     oc 1
1     this
1     his 
1     is i
1     s is
1     is 
1     is d
1     s do
1     doc
1     doc 
1     oc 2
3     doc 
3     oc 3
Name: doc_text, dtype: object

hash_values -> groupby+min

2    2005254014
0     216665549
1    1081029400
dtype: uint32

The need to explode each document into a bunch of tokens increase the memory consumption per document and adds the need to groupby document_id at the end

I was wondering if some sort of UDF or custom kernel might be more efficient for this task which can parallelize the sliding window hashing + storing minhash that would avoid this extra memory usage and the need for a groupby at the end.

cc: @davidwendt @wence-

The text was updated successfully, but these errors were encountered:

ayushdg · 2023-03-23T18:11:31Z

Edit: Also it is common in practice to compute multiple minhashes k, either by using k different hash function, 1 hash function and k different permutations of that hash, or 1 hash function with k different seeds.

Taking the last case as an example we would compute 1 minhash for each seed per document, and end up with k such minhashes.

Adds the `nvtext::minhash()` function to compute the MinHash of strings in a string column without needing extra memory to hold the intermediate substrings and hash values. This also uses the `MurmurHash3_32` algorithm directly so the hash results have better parity with external algorithms. The API also accepts a hash seed value. ``` std::unique_ptr<cudf::column> minhash( cudf::strings_column_view const& input, cudf::size_type width = 4, cudf::hash_value_type seed = cudf::DEFAULT_HASH_SEED, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); ``` Closes #12950 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) - Ayush Dattagupta (https://github.com/ayushdg) URL: #12961

ayushdg added question Further information is requested Needs Triage Need team to review and classify labels Mar 15, 2023

davidwendt self-assigned this Mar 15, 2023

davidwendt mentioned this issue Mar 16, 2023

Add nvtext::minhash function #12961

Merged

3 tasks

rapids-bot bot closed this as completed in #12961 Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Efficient minhashing with cuDF #12950

[QST] Efficient minhashing with cuDF #12950

ayushdg commented Mar 15, 2023

ayushdg commented Mar 23, 2023

[QST] Efficient minhashing with cuDF #12950

[QST] Efficient minhashing with cuDF #12950

Comments

ayushdg commented Mar 15, 2023

ayushdg commented Mar 23, 2023