Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #20345 Adds minhash filter for use in Elasticsearch #20346

Closed
wants to merge 1 commit into from
Closed

Closes #20345 Adds minhash filter for use in Elasticsearch #20346

wants to merge 1 commit into from

Conversation

softwaredoug
Copy link
Contributor

The minhash algorithm implements locality-sensitive hashing. Basically, every token is hashed into a hash table. The hash table is then sampled to generate a smaller set of representative tokens.

This has several applications

  • Dimensionality reduction that preserves information that can be used in similarity calculations. For example, efficient deduplication over large sets of text, more like this, etc.

The minhash filter has 4 parameters

  • bucket_size (def 512) how many buckets to use in the hash table. Will also be the number of tokens output from the minhash filter. More buckets means slower performance and more space, but greater precision.
  • num_hash_tables (def 1). As hash tables are sampled by looking at the first item in the bucket, adding more hash tables will provide added precision
  • bucket_depth (def 1). How many items should be stored in each bucket?
  • rotation (def true if num_buckets > 1). For any empty buckets, fills them with the contents of the next bucket

Implementation Notes/Questions

  • Are tests needed above the Lucene tests that already exist?
  • Do I need to provide documentation as well?
  • You'll note the fwd'ing to the Lucene filter factory. This is because of an annoying issue with the MinHashFilter. See this Lucene issue

import java.util.Map;


public class MinHashTokenFilterFactory extends AbstractTokenFilterFactory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be final?

@s1monw
Copy link
Contributor

s1monw commented Sep 6, 2016

Thanks for doing this!! I left some comments, I also wonder if we can get a simple test for all the parameters etc. like org.elasticsearch.index.analysis.KeepFilterFactoryTests. It also seems that you haven't signed the CLA, maybe you used a different mail address than in your git commits?

@s1monw
Copy link
Contributor

s1monw commented Sep 6, 2016

Do I need to provide documentation as well?

that would be fantastic

@s1monw s1monw self-assigned this Sep 6, 2016
@jasontedor
Copy link
Member

Isn't there already an open PR for this, #20206?

@s1monw
Copy link
Contributor

s1monw commented Sep 6, 2016

Isn't there already an open PR for this, #20206?

I agree the one you linked looks more complete, I think we should close this one?

@softwaredoug
Copy link
Contributor Author

@jasontedor so there is! I swear I searched for minhash, now I see it. Sorry about the dup. @s1monw not sure if you have a procedure for closing dups?

@s1monw I'll look into the CLA for future PRs.

@s1monw
Copy link
Contributor

s1monw commented Sep 6, 2016

thanks anyway @softwaredoug ;) doesn't happen too often - keep it coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants