Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synonym graph causes strange score on match_phrase query #43308

Closed
bbfsdev opened this issue Jun 18, 2019 · 7 comments · Fixed by #43941
Closed

Synonym graph causes strange score on match_phrase query #43308

bbfsdev opened this issue Jun 18, 2019 · 7 comments · Fixed by #43941
Labels
feedback_needed :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@bbfsdev
Copy link

bbfsdev commented Jun 18, 2019

Elasticsearch version (bin/elasticsearch --version):
7.1.1 (also at 6.7.2).
FYI, the bug don't exist at 6.5.4.

Plugins installed: No plugins.

JVM version (java -version):
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-b04)
OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux bbdev6.local 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

centos-release-7-6.1810.2.el7.centos.x86_64

Description of the problem including expected versus actual behavior:
When synonym graph together with hunspell for Hebrew is used and applied to specific query that uses match_phrase, the score of the query is from some reason much larger due to tokens from non-related documents.

Steps to reproduce:
Just copy/paste those curl commands:

DELETE test
{}

PUT test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "filter": {
        "synonym_graph": {
          "type": "synonym_graph",
          "synonyms": [

          ],
          "tokenizer": "keyword"
        },
        "he_IL": {
          "locale": "he_IL",
          "type": "hunspell",
          "dedup": "true"
        }
      },
      "analyzer": {
        "hebrew_synonym": {
          "filter": [
            "synonym_graph",
            "he_IL"
          ],
          "tokenizer": "standard"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "fields": {
          "language": {
            "type": "text",
            "analyzer": "hebrew_synonym"
          }
        },
        "type": "text",
        "analyzer": "standard"
      }
    }
  }
}

POST test/_doc/1
{
  "content": "מבואנוס"
}

POST test/_doc/2
{
  "content": "מבואו"
}

POST test/_doc/3
{
  "content": "מבוא לספר הזוהר"
}

POST test/_doc/4
{
  "content": "מבוארות"
}

POST test/_doc/5
{
  "content": "מבואה"
}

POST test/_doc/6
{
  "content": "מבוארים"
}

POST test/_doc/7
{
  "content": "בואקום"
}

POST test/_doc/8
{
  "content": "בואינג"
}

POST test/_doc/9
{
  "content": "בואהבת"
}

POST test/_doc/10
{
  "content": "בואנו"
}

POST test/_doc/11
{
  "content": "מבואסים"
}
POST test/_doc/12
{
  "content": "בואם"
}

POST test/_doc/13
{
  "content": "בואהבת"
}

GET test/_search
{
  "explain": true,
  "query": {
    "match_phrase": {
      "content.language": {
        "query": "מבוא לספר הזוהר"
      }
    }
  }
}

POST test/_close
PUT test/_settings
{
  "analysis": {
    "filter": {
      "synonym_graph": {
        "type": "synonym_graph",
        "synonyms": [
            "זוהר לעם,זוהר,ספר הזוהר,הזוהר"
        ],
        "tokenizer": "keyword"
      }
    }
  }
}
POST test/_open

GET test/_search
{
  "explain": true,
  "query": {
    "match_phrase": {
      "content.language": {
        "query": "מבוא לספר הזוהר"
      }
    }
  }
}

Provide logs (if relevant):

# DELETE test
{
  "acknowledged" : true
}


# PUT test
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "test"
}


# POST test/_doc/1
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}


# POST test/_doc/2
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}


# POST test/_doc/3
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "3",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 2,
  "_primary_term" : 1
}


# POST test/_doc/4
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "4",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 3,
  "_primary_term" : 1
}


# POST test/_doc/5
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "5",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 4,
  "_primary_term" : 1
}


# POST test/_doc/6
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "6",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 5,
  "_primary_term" : 1
}


# POST test/_doc/7
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "7",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 6,
  "_primary_term" : 1
}


# POST test/_doc/8
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "8",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 7,
  "_primary_term" : 1
}


# POST test/_doc/9
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "9",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 8,
  "_primary_term" : 1
}


# POST test/_doc/10
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "10",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 9,
  "_primary_term" : 1
}


# POST test/_doc/11
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "11",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 10,
  "_primary_term" : 1
}


# POST test/_doc/12
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "12",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 11,
  "_primary_term" : 1
}


# POST test/_doc/13
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "13",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 12,
  "_primary_term" : 1
}


# GET test/_search
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 8.552871,
    "hits" : [
      {
        "_shard" : "[test][0]",
        "_node" : "76xQgsOsSz6OqfkSZmsVQw",
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 8.552871,
        "_source" : {
          "content" : "מבוא לספר הזוהר"
        },
        "_explanation" : {
          "value" : 8.552871,
          "description" : """weight(content.language:"(מבוא בוא) ספר זוהר" in 2) [PerFieldSimilarity], result of:""",
          "details" : [
            .......
          ]
        }
      }
    ]
  }
}


# POST test/_close
{
  "acknowledged" : true
}


# PUT test/_settings
{
  "acknowledged" : true
}


# POST test/_open
{
  "acknowledged" : true,
  "shards_acknowledged" : true
}


# GET test/_search
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 38.78237,
    "hits" : [
      {
        "_shard" : "[test][0]",
        "_node" : "76xQgsOsSz6OqfkSZmsVQw",
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 38.78237,
        "_source" : {
          "content" : "מבוא לספר הזוהר"
        },
        "_explanation" : {
          "value" : 38.78237,
          "description" : "weight(spanNear([spanOr([spanOr([content.language:מבואנוס, content.language:מבואו, content.language:מבוארות, content.language:מבואה, content.language:מבואסים, content.language:מבוארים, content.language:מבוא]), spanOr([content.language:בוא, content.language:בואנו, content.language:בואה, content.language:בואם, content.language:בואינג, content.language:בואו, content.language:בואקום, content.language:בואהבת])]), content.language:ספר, spanOr([spanNear([content.language:זוהר, content.language:עם], 0, true), content.language:זוהר, spanNear([content.language:ספר, content.language:זוהר], 0, true), content.language:זוהר])], 0, true) in 2) [PerFieldSimilarity], result of:",
          "details" : [
            {
   ..........
          ]
        }
      }
    ]
  }
}

In the logs above you can see 2 queries.
First query done when synonyms list is empty. The score is small, i.e., 8.5 and the result is reasonable.
Second query done when synonym list is "זוהר לעם,זוהר,ספר הזוהר,הזוהר" which might add some value to the score but the score is unproportionally large and what is more interesting depends on other non-related to query nor to synonyms documents (this can be seen the in the explanation of the second query):
...
"description" : "weight(spanNear([spanOr([spanOr([content.language:מבואנוס, content.language:מבואו, content.language:מבוארות, content.language:מבואה, content.language:מבואסים, content.language:מבוארים, content.language:מבוא]), spanOr([content.language:בוא, content.language:בואנו, content.language:בואה, content.language:בואם, content.language:בואינג, content.language:בואו, content.language:בואקום, content.language:בואהבת])]), content.language:ספר, spanOr([spanNear([content.language:זוהר, content.language:עם], 0, true), content.language:זוהר, spanNear([content.language:ספר, content.language:זוהר], 0, true), content.language:זוהר])], 0, true) in 2) [PerFieldSimilarity], result of:"
...

@dnhatn dnhatn added the :Search Relevance/Analysis How text is split into tokens label Jun 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@jimczi
Copy link
Contributor

jimczi commented Jun 19, 2019

We use span queries here in order to avoid the combinatorial explosion on multi-word synonyms (see https://issues.apache.org/jira/browse/LUCENE-7699 for some context).
The score for span queries is computed from all terms that appear in the query even if only a small portion of them matches in the document. So it is expected that a query with a lot of multi-word synonyms have a bigger score than without. However I don't consider this as an issue since the score should be comparable between documents and we don't want to give higher scores to documents that match an alternative path. Can you explain why this is a problem for you and what you think would be the expected behavior ?

@bbfsdev
Copy link
Author

bbfsdev commented Jun 23, 2019

Hello @jimczi I read your comment and the Lucene issue. I cannot say that I fully understand combinatorial explosion and span queries but there are few things in that case that work as I would not expect:

  1. The explain of the query shows that the high score (several orders higher than expected) is a result of sum of matches of multi-phrase query and the document. But the words that are shown in the explain don't appear in the document at all nor in the query. Basically elastic gives higher score when matching a query and a document on words that don't appear both in the query, nor in the document. This is the unexpected behavior. Those words appear in other documents. Please see the steps to reproduce.
  2. For elastic version 6.5.4, the behavior is not reproducible, meaning it returns logical score as expected. For versions 7.1.1 and 6.7.2 the behavior reproduces. So if that was not a bug or an issue I would expect similar behavior in all versions.

If that is expected, I would love to better understand why and how should I work with this.

@bbfsdev
Copy link
Author

bbfsdev commented Jul 3, 2019

Pinging @elastic/es-search

@bbfsdev
Copy link
Author

bbfsdev commented Jul 3, 2019

@jimczi Friendly ping.

jimczi added a commit to jimczi/elasticsearch that referenced this issue Jul 3, 2019
Disjunction over two individual terms in a phrase query with multi-word synonyms
wrongly applies a prefix query to each of these terms. This change fixes this bug
by inversing the logic to use prefixes on `phrase_prefix` queries only.

Closes elastic#43308
@jimczi
Copy link
Contributor

jimczi commented Jul 3, 2019

Sorry for the late reply @bbfsdev , I am able to reproduce the issue and found the bug. We're expanding every position with multiple terms (different stemming for the same term for instance) to span prefix queries so this explains why the final query is so big.
That's the version of the query with the bug:

GET test/_validate/query?explain
>>
{
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "valid": true,
    "explanations": [
        {
            "index": "test",
            "valid": true,
            "explanation": "spanNear([spanOr([SpanMultiTermQueryWrapper(content.language:מבוא*), SpanMultiTermQueryWrapper(content.language:בוא*)]), content.language:ספר, spanOr([spanNear([content.language:זוהר, content.language:עם], 0, true), content.language:זוהר, spanNear([content.language:ספר, content.language:זוהר], 0, true), content.language:זוהר])], 0, true)"
        }
    ]
}

I opened #43941 to fix the bug, thanks for reporting!

jimczi added a commit that referenced this issue Jul 4, 2019
)

Disjunction over two individual terms in a phrase query with multi-word synonyms
wrongly applies a prefix query to each of these terms. This change fixes this bug
by inversing the logic to use prefixes on `phrase_prefix` queries only.

Closes #43308
jimczi added a commit that referenced this issue Jul 4, 2019
)

Disjunction over two individual terms in a phrase query with multi-word synonyms
wrongly applies a prefix query to each of these terms. This change fixes this bug
by inversing the logic to use prefixes on `phrase_prefix` queries only.

Closes #43308
jimczi added a commit that referenced this issue Jul 4, 2019
)

Disjunction over two individual terms in a phrase query with multi-word synonyms
wrongly applies a prefix query to each of these terms. This change fixes this bug
by inversing the logic to use prefixes on `phrase_prefix` queries only.

Closes #43308
jimczi added a commit that referenced this issue Jul 4, 2019
)

Disjunction over two individual terms in a phrase query with multi-word synonyms
wrongly applies a prefix query to each of these terms. This change fixes this bug
by inversing the logic to use prefixes on `phrase_prefix` queries only.

Closes #43308
jimczi added a commit that referenced this issue Jul 4, 2019
)

Disjunction over two individual terms in a phrase query with multi-word synonyms
wrongly applies a prefix query to each of these terms. This change fixes this bug
by inversing the logic to use prefixes on `phrase_prefix` queries only.

Closes #43308
@nmdoliveira
Copy link

Hi @jimczi, I don't know if this is the right place to ask, but I believe I have a very similar problem in version 6.7.0 with a multi_match phrase_prefix query with max_expansions set to 100.

Every expansion used to match on the document (which I'm assuming is analogous to synonyms) contributes with a small score, but then all of them are summed to get the final score, which makes some documents score really high just because they needed more expansions to match.

I indexed some fields with index_prefixes to improve this query (as made possible by #37436), and this seems to fix their score, but they get much smaller scores than the fields that do not have index_prefixes and match because of prefix expansions.

Do you think #43941 is going to fix this case too?

Thanks, and please let me know if I should open a separate issue.

@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback_needed :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants