Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing lucene 6.x minhash filter. #20206

Merged
merged 2 commits into from
Sep 7, 2016
Merged

Conversation

a2lin
Copy link
Contributor

@a2lin a2lin commented Aug 29, 2016

I've tried to expose the 6.x minhash filter, and wrote some documentation from my (very inexpert) understanding on how this works.

I was mostly playing around with it using the following scenario that I cribbed off of the lucene ticket.

result (hits)

{
  "hits":[
    {
      "_index":"lsh_test",
      "_type":"doc_type",
      "_id":"2",
      "_score":1.4709876,
      "_source":{
        "doc_lsh":"elasticsearch is also an open source enterprise search engine based on Lucene"
      }
    },
    {
      "_index":"lsh_test",
      "_type":"doc_type",
      "_id":"1",
      "_score":0.61897373,
      "_source":{
        "doc_lsh":"elasticsearch is also an open source search engine based on Lucene"
      }
    },
    {
      "_index":"lsh_test",
      "_type":"doc_type",
      "_id":"3",
      "_score":0.4758651,
      "_source":{
        "doc_lsh":"elasticsearch is also a popular open source enterprise search engine based on Lucene"
      }
    }
  ]
}

query:

{
  "match": "elasticsearch is also an open source search engine"
}

(bulk) indexed documents:

{ "index" : { "_index" : "lsh_test", "_type" : "doc_type", "_id" : "1" }}
{"doc_lsh": "elasticsearch is also an open source search engine based on Lucene"}
{ "index" : { "_index" : "lsh_test", "_type" : "doc_type", "_id" : "2" }}
{"doc_lsh": "elasticsearch is also an open source enterprise search engine based on Lucene"}
{ "index" : { "_index" : "lsh_test", "_type" : "doc_type", "_id" : "3" }}
{"doc_lsh": "elasticsearch is also a popular open source enterprise search engine based on Lucene"}
{ "index" : { "_index" : "lsh_test", "_type" : "doc_type", "_id" : "4" }}
{"doc_lsh": "Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java"}

mapping:

{
  "settings":{
    "analysis":{
      "filter":{
        "five_shingle":{
          "type":"shingle",
          "max_shingle_size":5,
          "min_shingle_size":3,
          "output_unigrams":false
        },
        "minhash":{
          "type":"min_hash",
          "with_rotation":"false"
        }
      },
      "analyzer":{
        "lsh_analyzer":{
          "type":"custom",
          "tokenizer":"whitespace",
          "filter":[
            "five_shingle",
            "minhash"
          ]
        }
      }
    }
  },
  "mappings":{
    "doc_type":{
      "properties":{
        "doc_lsh":{
          "type":"text",
          "analyzer":"lsh_analyzer",
          "store":"true"
        }
      }
    }
  }
}

Closes #20149.

[[analysis-minhash-tokenfilter]]
== Minhash Token Filter

A token filter of type `minhash` hashes each token of the token stream and divides
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/minhash/min_hash/

@clintongormley
Copy link
Contributor

Hi @a2lin

Thanks for the PR. This needs documentation added too. Btw, why are you using shingles of 3..5 words? This will create many more tokens than you really need and isn't considered good practice. Instead I'd just use shingles of length 2.

@a2lin
Copy link
Contributor Author

a2lin commented Aug 29, 2016

@clintongormley Thanks for the advice. I poked around with 5 because the function comment for MinHashFilter suggests that the expected incoming tokens are 5 word shingles.

Can you link me to an example of the documentation that this feature needs added? I could only find the shingled version of:

docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc

when I searched the code for the shingleTokenFilterFactory analogue.

@clintongormley
Copy link
Contributor

@a2lin
Copy link
Contributor Author

a2lin commented Aug 29, 2016

@clintongormley Oops, I thought that was generated from the file that @jpountz commented on. I'll look again.

@clintongormley
Copy link
Contributor

Oh sorry @a2lin - I completely missed the asciidoc!

@s1monw
Copy link
Contributor

s1monw commented Sep 6, 2016

ok to test

@s1monw s1monw merged commit f825e8f into elastic:master Sep 7, 2016
@s1monw
Copy link
Contributor

s1monw commented Sep 7, 2016

@a2lin thanks for fixing this!!!!!

@a2lin
Copy link
Contributor Author

a2lin commented Sep 7, 2016

@s1monw thanks for merging!

MaineC pushed a commit to MaineC/elasticsearch that referenced this pull request Sep 7, 2016
Exposing lucene 6.x minhash tokenfilter

Generate min hash tokens from an incoming stream of tokens that can
be used to estimate document similarity.

Closes elastic#20149
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants