Synonym graph causes strange score on match_phrase query #43308

bbfsdev · 2019-06-18T00:43:59Z

Elasticsearch version (bin/elasticsearch --version):
7.1.1 (also at 6.7.2).
FYI, the bug don't exist at 6.5.4.

Plugins installed: No plugins.

JVM version (java -version):
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-b04)
OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux bbdev6.local 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

centos-release-7-6.1810.2.el7.centos.x86_64

Description of the problem including expected versus actual behavior:
When synonym graph together with hunspell for Hebrew is used and applied to specific query that uses match_phrase, the score of the query is from some reason much larger due to tokens from non-related documents.

Steps to reproduce:
Just copy/paste those curl commands:

DELETE test
{}

PUT test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "filter": {
        "synonym_graph": {
          "type": "synonym_graph",
          "synonyms": [

          ],
          "tokenizer": "keyword"
        },
        "he_IL": {
          "locale": "he_IL",
          "type": "hunspell",
          "dedup": "true"
        }
      },
      "analyzer": {
        "hebrew_synonym": {
          "filter": [
            "synonym_graph",
            "he_IL"
          ],
          "tokenizer": "standard"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "fields": {
          "language": {
            "type": "text",
            "analyzer": "hebrew_synonym"
          }
        },
        "type": "text",
        "analyzer": "standard"
      }
    }
  }
}

POST test/_doc/1
{
  "content": "מבואנוס"
}

POST test/_doc/2
{
  "content": "מבואו"
}

POST test/_doc/3
{
  "content": "מבוא לספר הזוהר"
}

POST test/_doc/4
{
  "content": "מבוארות"
}

POST test/_doc/5
{
  "content": "מבואה"
}

POST test/_doc/6
{
  "content": "מבוארים"
}

POST test/_doc/7
{
  "content": "בואקום"
}

POST test/_doc/8
{
  "content": "בואינג"
}

POST test/_doc/9
{
  "content": "בואהבת"
}

POST test/_doc/10
{
  "content": "בואנו"
}

POST test/_doc/11
{
  "content": "מבואסים"
}
POST test/_doc/12
{
  "content": "בואם"
}

POST test/_doc/13
{
  "content": "בואהבת"
}

GET test/_search
{
  "explain": true,
  "query": {
    "match_phrase": {
      "content.language": {
        "query": "מבוא לספר הזוהר"
      }
    }
  }
}

POST test/_close
PUT test/_settings
{
  "analysis": {
    "filter": {
      "synonym_graph": {
        "type": "synonym_graph",
        "synonyms": [
            "זוהר לעם,זוהר,ספר הזוהר,הזוהר"
        ],
        "tokenizer": "keyword"
      }
    }
  }
}
POST test/_open

GET test/_search
{
  "explain": true,
  "query": {
    "match_phrase": {
      "content.language": {
        "query": "מבוא לספר הזוהר"
      }
    }
  }
}

Provide logs (if relevant):

# DELETE test
{
  "acknowledged" : true
}


# PUT test
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "test"
}


# POST test/_doc/1
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}


# POST test/_doc/2
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}


# POST test/_doc/3
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "3",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 2,
  "_primary_term" : 1
}


# POST test/_doc/4
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "4",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 3,
  "_primary_term" : 1
}


# POST test/_doc/5
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "5",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 4,
  "_primary_term" : 1
}


# POST test/_doc/6
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "6",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 5,
  "_primary_term" : 1
}


# POST test/_doc/7
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "7",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 6,
  "_primary_term" : 1
}


# POST test/_doc/8
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "8",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 7,
  "_primary_term" : 1
}


# POST test/_doc/9
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "9",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 8,
  "_primary_term" : 1
}


# POST test/_doc/10
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "10",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 9,
  "_primary_term" : 1
}


# POST test/_doc/11
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "11",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 10,
  "_primary_term" : 1
}


# POST test/_doc/12
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "12",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 11,
  "_primary_term" : 1
}


# POST test/_doc/13
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "13",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 12,
  "_primary_term" : 1
}


# GET test/_search
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 8.552871,
    "hits" : [
      {
        "_shard" : "[test][0]",
        "_node" : "76xQgsOsSz6OqfkSZmsVQw",
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 8.552871,
        "_source" : {
          "content" : "מבוא לספר הזוהר"
        },
        "_explanation" : {
          "value" : 8.552871,
          "description" : """weight(content.language:"(מבוא בוא) ספר זוהר" in 2) [PerFieldSimilarity], result of:""",
          "details" : [
            .......
          ]
        }
      }
    ]
  }
}


# POST test/_close
{
  "acknowledged" : true
}


# PUT test/_settings
{
  "acknowledged" : true
}


# POST test/_open
{
  "acknowledged" : true,
  "shards_acknowledged" : true
}


# GET test/_search
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 38.78237,
    "hits" : [
      {
        "_shard" : "[test][0]",
        "_node" : "76xQgsOsSz6OqfkSZmsVQw",
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 38.78237,
        "_source" : {
          "content" : "מבוא לספר הזוהר"
        },
        "_explanation" : {
          "value" : 38.78237,
          "description" : "weight(spanNear([spanOr([spanOr([content.language:מבואנוס, content.language:מבואו, content.language:מבוארות, content.language:מבואה, content.language:מבואסים, content.language:מבוארים, content.language:מבוא]), spanOr([content.language:בוא, content.language:בואנו, content.language:בואה, content.language:בואם, content.language:בואינג, content.language:בואו, content.language:בואקום, content.language:בואהבת])]), content.language:ספר, spanOr([spanNear([content.language:זוהר, content.language:עם], 0, true), content.language:זוהר, spanNear([content.language:ספר, content.language:זוהר], 0, true), content.language:זוהר])], 0, true) in 2) [PerFieldSimilarity], result of:",
          "details" : [
            {
   ..........
          ]
        }
      }
    ]
  }
}

In the logs above you can see 2 queries.
First query done when synonyms list is empty. The score is small, i.e., 8.5 and the result is reasonable.
Second query done when synonym list is "זוהר לעם,זוהר,ספר הזוהר,הזוהר" which might add some value to the score but the score is unproportionally large and what is more interesting depends on other non-related to query nor to synonyms documents (this can be seen the in the explanation of the second query):
...
"description" : "weight(spanNear([spanOr([spanOr([content.language:מבואנוס, content.language:מבואו, content.language:מבוארות, content.language:מבואה, content.language:מבואסים, content.language:מבוארים, content.language:מבוא]), spanOr([content.language:בוא, content.language:בואנו, content.language:בואה, content.language:בואם, content.language:בואינג, content.language:בואו, content.language:בואקום, content.language:בואהבת])]), content.language:ספר, spanOr([spanNear([content.language:זוהר, content.language:עם], 0, true), content.language:זוהר, spanNear([content.language:ספר, content.language:זוהר], 0, true), content.language:זוהר])], 0, true) in 2) [PerFieldSimilarity], result of:"
...

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-06-18T03:00:03Z

Pinging @elastic/es-search

jimczi · 2019-06-19T19:50:07Z

We use span queries here in order to avoid the combinatorial explosion on multi-word synonyms (see https://issues.apache.org/jira/browse/LUCENE-7699 for some context).
The score for span queries is computed from all terms that appear in the query even if only a small portion of them matches in the document. So it is expected that a query with a lot of multi-word synonyms have a bigger score than without. However I don't consider this as an issue since the score should be comparable between documents and we don't want to give higher scores to documents that match an alternative path. Can you explain why this is a problem for you and what you think would be the expected behavior ?

bbfsdev · 2019-06-23T17:39:52Z

Hello @jimczi I read your comment and the Lucene issue. I cannot say that I fully understand combinatorial explosion and span queries but there are few things in that case that work as I would not expect:

The explain of the query shows that the high score (several orders higher than expected) is a result of sum of matches of multi-phrase query and the document. But the words that are shown in the explain don't appear in the document at all nor in the query. Basically elastic gives higher score when matching a query and a document on words that don't appear both in the query, nor in the document. This is the unexpected behavior. Those words appear in other documents. Please see the steps to reproduce.
For elastic version 6.5.4, the behavior is not reproducible, meaning it returns logical score as expected. For versions 7.1.1 and 6.7.2 the behavior reproduces. So if that was not a bug or an issue I would expect similar behavior in all versions.

If that is expected, I would love to better understand why and how should I work with this.

bbfsdev · 2019-07-03T16:18:28Z

Pinging @elastic/es-search

bbfsdev · 2019-07-03T16:19:15Z

@jimczi Friendly ping.

Disjunction over two individual terms in a phrase query with multi-word synonyms wrongly applies a prefix query to each of these terms. This change fixes this bug by inversing the logic to use prefixes on `phrase_prefix` queries only. Closes elastic#43308

jimczi · 2019-07-03T20:21:24Z

Sorry for the late reply @bbfsdev , I am able to reproduce the issue and found the bug. We're expanding every position with multiple terms (different stemming for the same term for instance) to span prefix queries so this explains why the final query is so big.
That's the version of the query with the bug:

GET test/_validate/query?explain
>>
{
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "valid": true,
    "explanations": [
        {
            "index": "test",
            "valid": true,
            "explanation": "spanNear([spanOr([SpanMultiTermQueryWrapper(content.language:מבוא*), SpanMultiTermQueryWrapper(content.language:בוא*)]), content.language:ספר, spanOr([spanNear([content.language:זוהר, content.language:עם], 0, true), content.language:זוהר, spanNear([content.language:ספר, content.language:זוהר], 0, true), content.language:זוהר])], 0, true)"
        }
    ]
}

I opened #43941 to fix the bug, thanks for reporting!

) Disjunction over two individual terms in a phrase query with multi-word synonyms wrongly applies a prefix query to each of these terms. This change fixes this bug by inversing the logic to use prefixes on `phrase_prefix` queries only. Closes #43308

nmdoliveira · 2019-07-12T14:12:46Z

Hi @jimczi, I don't know if this is the right place to ask, but I believe I have a very similar problem in version 6.7.0 with a multi_match phrase_prefix query with max_expansions set to 100.

Every expansion used to match on the document (which I'm assuming is analogous to synonyms) contributes with a small score, but then all of them are summed to get the final score, which makes some documents score really high just because they needed more expansions to match.

I indexed some fields with index_prefixes to improve this query (as made possible by #37436), and this seems to fix their score, but they get much smaller scores than the fields that do not have index_prefixes and match because of prefix expansions.

Do you think #43941 is going to fix this case too?

Thanks, and please let me know if I should open a separate issue.

dnhatn added the :Search Relevance/Analysis How text is split into tokens label Jun 18, 2019

jimczi added the feedback_needed label Jun 19, 2019

jimczi mentioned this issue Jul 3, 2019

Fix wrong logic in match_phrase query with multi-word synonyms #43941

Merged

jimczi closed this as completed in #43941 Jul 4, 2019

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synonym graph causes strange score on match_phrase query #43308

Synonym graph causes strange score on match_phrase query #43308

bbfsdev commented Jun 18, 2019

elasticmachine commented Jun 18, 2019

jimczi commented Jun 19, 2019

bbfsdev commented Jun 23, 2019

bbfsdev commented Jul 3, 2019

bbfsdev commented Jul 3, 2019

jimczi commented Jul 3, 2019

nmdoliveira commented Jul 12, 2019

Synonym graph causes strange score on match_phrase query #43308

Synonym graph causes strange score on match_phrase query #43308

Comments

bbfsdev commented Jun 18, 2019

elasticmachine commented Jun 18, 2019

jimczi commented Jun 19, 2019

bbfsdev commented Jun 23, 2019

bbfsdev commented Jul 3, 2019

bbfsdev commented Jul 3, 2019

jimczi commented Jul 3, 2019

nmdoliveira commented Jul 12, 2019