-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exposing lucene 6.x minhash filter. #20206
Conversation
[[analysis-minhash-tokenfilter]] | ||
== Minhash Token Filter | ||
|
||
A token filter of type `minhash` hashes each token of the token stream and divides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/minhash/min_hash/
Hi @a2lin Thanks for the PR. This needs documentation added too. Btw, why are you using shingles of 3..5 words? This will create many more tokens than you really need and isn't considered good practice. Instead I'd just use shingles of length 2. |
@clintongormley Thanks for the advice. I poked around with 5 because the function comment for MinHashFilter suggests that the expected incoming tokens are 5 word shingles. Can you link me to an example of the documentation that this feature needs added? I could only find the shingled version of:
when I searched the code for the shingleTokenFilterFactory analogue. |
@clintongormley Oops, I thought that was generated from the file that @jpountz commented on. I'll look again. |
Oh sorry @a2lin - I completely missed the asciidoc! |
ok to test |
@a2lin thanks for fixing this!!!!! |
@s1monw thanks for merging! |
Exposing lucene 6.x minhash tokenfilter Generate min hash tokens from an incoming stream of tokens that can be used to estimate document similarity. Closes elastic#20149
I've tried to expose the 6.x minhash filter, and wrote some documentation from my (very inexpert) understanding on how this works.
I was mostly playing around with it using the following scenario that I cribbed off of the lucene ticket.
result (hits)
query:
(bulk) indexed documents:
mapping:
Closes #20149.