[FEA] Word ngram based minhashes #15055

ayushdg · 2024-02-14T19:43:47Z

Is your feature request related to a problem? Please describe.
The current minhashes functionality/API #12961 computes minhashes by taking a fixed character width sliding window and computes the minhashes for a document using character ngrams.

Another alternative approach to minhashing documents is to instead use word based ngrams instead of char based and research seems to suggest it leads to better quality results (lower false positives during Locality sensitive hashing).

Describe the solution you'd like
Add a word ngram based minhash support that's accelerated on GPUs.

Describe alternatives you've considered
Currently most CPU libraries achieve this by first tokenizing the string into word ngrams, and then looping over these tokens computing the minhash.
It is possible to somewhat mimic that with a str.word_tokenize + str.hash_values + groupby + min in cuDF but requires a lot more intermediate memory and reduces the batch size of documents processed. It is also slower than a custom minhash implementation.

Another challenge is that the definition of a "word" can be different for different languages and there's often different approaches to create word_ngrams based on language.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

The text was updated successfully, but these errors were encountered:

Experimental implementation for #15055 The input is a lists column of strings where each string in each row is expected as a word to be hashed. The minimum hash for that row is returned in a lists column where each row contains a minhash per input hash seed. Here the caller is expected to produce the words to be hashed. ``` std::unique_ptr<cudf::column> word_minhash( cudf::lists_column_view const& input, cudf::device_span<uint32_t const> seeds, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr); ``` Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15368

davidwendt · 2024-12-13T19:56:15Z

Word-based minhash API was deprecated in 24.12 and removed in 25.02.
We can reopen this issue if this becomes a new requirement in the future.

ayushdg added the feature request New feature or request label Feb 14, 2024

davidwendt self-assigned this Feb 14, 2024

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Feb 15, 2024

GregoryKimball added this to libcudf Feb 15, 2024

GregoryKimball added this to the Language model acceleration milestone Feb 15, 2024

GregoryKimball moved this to In progress in libcudf Feb 20, 2024

davidwendt mentioned this issue Mar 21, 2024

Word-based nvtext::minhash function #15368

Merged

3 tasks

davidwendt closed this as completed Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Word ngram based minhashes #15055

[FEA] Word ngram based minhashes #15055

ayushdg commented Feb 14, 2024 •

edited

Loading

davidwendt commented Dec 13, 2024

[FEA] Word ngram based minhashes #15055

[FEA] Word ngram based minhashes #15055

Comments

ayushdg commented Feb 14, 2024 • edited Loading

davidwendt commented Dec 13, 2024

ayushdg commented Feb 14, 2024 •

edited

Loading