-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synonym graph causes strange score on match_phrase query #43308
Comments
Pinging @elastic/es-search |
We use |
Hello @jimczi I read your comment and the Lucene issue. I cannot say that I fully understand combinatorial explosion and span queries but there are few things in that case that work as I would not expect:
If that is expected, I would love to better understand why and how should I work with this. |
Pinging @elastic/es-search |
@jimczi Friendly ping. |
Disjunction over two individual terms in a phrase query with multi-word synonyms wrongly applies a prefix query to each of these terms. This change fixes this bug by inversing the logic to use prefixes on `phrase_prefix` queries only. Closes elastic#43308
Sorry for the late reply @bbfsdev , I am able to reproduce the issue and found the bug. We're expanding every position with multiple terms (different stemming for the same term for instance) to span prefix queries so this explains why the final query is so big.
I opened #43941 to fix the bug, thanks for reporting! |
) Disjunction over two individual terms in a phrase query with multi-word synonyms wrongly applies a prefix query to each of these terms. This change fixes this bug by inversing the logic to use prefixes on `phrase_prefix` queries only. Closes #43308
) Disjunction over two individual terms in a phrase query with multi-word synonyms wrongly applies a prefix query to each of these terms. This change fixes this bug by inversing the logic to use prefixes on `phrase_prefix` queries only. Closes #43308
) Disjunction over two individual terms in a phrase query with multi-word synonyms wrongly applies a prefix query to each of these terms. This change fixes this bug by inversing the logic to use prefixes on `phrase_prefix` queries only. Closes #43308
) Disjunction over two individual terms in a phrase query with multi-word synonyms wrongly applies a prefix query to each of these terms. This change fixes this bug by inversing the logic to use prefixes on `phrase_prefix` queries only. Closes #43308
) Disjunction over two individual terms in a phrase query with multi-word synonyms wrongly applies a prefix query to each of these terms. This change fixes this bug by inversing the logic to use prefixes on `phrase_prefix` queries only. Closes #43308
Hi @jimczi, I don't know if this is the right place to ask, but I believe I have a very similar problem in version 6.7.0 with a Every expansion used to match on the document (which I'm assuming is analogous to synonyms) contributes with a small score, but then all of them are summed to get the final score, which makes some documents score really high just because they needed more expansions to match. I indexed some fields with Do you think #43941 is going to fix this case too? Thanks, and please let me know if I should open a separate issue. |
Elasticsearch version (
bin/elasticsearch --version
):7.1.1 (also at 6.7.2).
FYI, the bug don't exist at 6.5.4.
Plugins installed: No plugins.
JVM version (
java -version
):openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-b04)
OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode)
OS version (
uname -a
if on a Unix-like system):Linux bbdev6.local 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
centos-release-7-6.1810.2.el7.centos.x86_64
Description of the problem including expected versus actual behavior:
When synonym graph together with hunspell for Hebrew is used and applied to specific query that uses match_phrase, the score of the query is from some reason much larger due to tokens from non-related documents.
Steps to reproduce:
Just copy/paste those curl commands:
Provide logs (if relevant):
In the logs above you can see 2 queries.
First query done when synonyms list is empty. The score is small, i.e., 8.5 and the result is reasonable.
Second query done when synonym list is "זוהר לעם,זוהר,ספר הזוהר,הזוהר" which might add some value to the score but the score is unproportionally large and what is more interesting depends on other non-related to query nor to synonyms documents (this can be seen the in the explanation of the second query):
...
"description" : "weight(spanNear([spanOr([spanOr([content.language:מבואנוס, content.language:מבואו, content.language:מבוארות, content.language:מבואה, content.language:מבואסים, content.language:מבוארים, content.language:מבוא]), spanOr([content.language:בוא, content.language:בואנו, content.language:בואה, content.language:בואם, content.language:בואינג, content.language:בואו, content.language:בואקום, content.language:בואהבת])]), content.language:ספר, spanOr([spanNear([content.language:זוהר, content.language:עם], 0, true), content.language:זוהר, spanNear([content.language:ספר, content.language:זוהר], 0, true), content.language:זוהר])], 0, true) in 2) [PerFieldSimilarity], result of:"
...
The text was updated successfully, but these errors were encountered: