[QST] Efficient minhashing with cuDF #12950
Labels
2 - In Progress
Currently a work in progress
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
strings
strings issues (C++ and Python)
What is your question?
Minhashing or locality sensitive hashing is a popular technique used to group similar documents based on the similarity b/w words/shingles in those documents.
Right now I have an approach that computes the minhashes for documents as follows:
Series of documents -> n-grams(shingles)-> hash_values -> groupby+min
Documents
exploded n-grams
hash_values -> groupby+min
The need to explode each document into a bunch of tokens increase the memory consumption per document and adds the need to groupby document_id at the end
I was wondering if some sort of UDF or custom kernel might be more efficient for this task which can parallelize the sliding window hashing + storing minhash that would avoid this extra memory usage and the need for a groupby at the end.
cc: @davidwendt @wence-
The text was updated successfully, but these errors were encountered: