[FEA] Word ngram based minhashes #15055
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
strings
strings issues (C++ and Python)
Milestone
Is your feature request related to a problem? Please describe.
The current minhashes functionality/API #12961 computes minhashes by taking a fixed character width sliding window and computes the minhashes for a document using character ngrams.
Another alternative approach to minhashing documents is to instead use word based ngrams instead of char based and research seems to suggest it leads to better quality results (lower false positives during Locality sensitive hashing).
Describe the solution you'd like
Add a word ngram based minhash support that's accelerated on GPUs.
Describe alternatives you've considered
Currently most CPU libraries achieve this by first tokenizing the string into word ngrams, and then looping over these tokens computing the minhash.
It is possible to somewhat mimic that with a
str.word_tokenize + str.hash_values + groupby + min
in cuDF but requires a lot more intermediate memory and reduces the batch size of documents processed. It is also slower than a custom minhash implementation.Another challenge is that the definition of a "word" can be different for different languages and there's often different approaches to create word_ngrams based on language.
Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.
The text was updated successfully, but these errors were encountered: