-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add minimal sanity checks to custom/scripted similarities. #33564
Conversation
Lucene 8 introduced more constraints on similarities, in particular: - scores must not be negative, - scores must not decrease when term freq increases, - scores must not increase when norm (interpreted as an unsigned long) increases. We can't check every single case, but could at least run some sanity checks. Relates elastic#33309
Pinging @elastic/es-search-aggs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I wonder if this is sufficient for custom similarities though. Should we also check scores at runtime to catch all negative scores ?
Can we check at runtime without slowing things down massively though? |
@jimczi @romseygeek I added runtime checks for negative scores. This is only done for custom types and scripted similarities to keep the overhead contained, and only adds one float comparison otherwise. I also added an escape hatch which truncates scores so that you are not doomed to reindex in order to change your similarity in case you happen to discover corner-cases eg. in a scripted similarity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jpountz, the escape hatch is a great idea.
This job triggered CI during a migration of the master. Kicking off an additional build for you manually... Jenkins, test this please. |
Jenkins, test this please. |
* master: (46 commits) Fixing assertions in integration test (elastic#33833) [CCR] Rename idle_shard_retry_delay to poll_timout in auto follow patterns (elastic#33821) HLRC: Delete ML calendar (elastic#33775) Move DocsStats into Engine (elastic#33835) [Docs] Clarify accessing Date methods in painless (elastic#33560) add elasticsearch-shard tool (elastic#32281) Cut over to unwrap segment reader (elastic#33843) SQL: Fix issue with options for QUERY() and MATCH(). (elastic#33828) Emphasize that filesystem-level backups don't work (elastic#33102) Use the global doc id to generate a random score (elastic#33599) Add minimal sanity checks to custom/scripted similarities. (elastic#33564) Profiler: Don’t profile NEXTDOC for ConstantScoreQuery. (elastic#33196) [CCR] Change FollowIndexAction.Request class to be more user friendly (elastic#33810) SQL: day and month name functions tests locale providers enforcement (elastic#33653) TESTS: Set SO_LINGER = 0 for MockNioTransport (elastic#32560) Test: Relax jarhell gradle test (elastic#33787) [CCR] Fail with a descriptive error if leader index does not exist (elastic#33797) Add ES version 6.4.2 (elastic#33831) MINOR: Remove Some Dead Code in Scripting (elastic#33800) Ensure realtime `_get` and `_termvectors` don't run on the network thread (elastic#33814) ...
…#33893) Add minimal sanity checks to custom/scripted similarities. Lucene 8 introduced more constraints on similarities, in particular: - scores must not be negative, - scores must not decrease when term freq increases, - scores must not increase when norm (interpreted as an unsigned long) increases. We can't check every single case, but could at least run some sanity checks. Backport of #33564
Regarding the first and last statement i completely understand why you want to avoid those cases, but regarding 'scores must not decrease when term freq increases' can i ask why it shall be checked. Is there any issue with the performance if this case happens? You states that Lucence 8 forced this sanity check, so even though if somehow i managed to skip this specific sanity check, then will throw some exception? If i understand well now if you search for 'abc' on specific field and you find a document, field of which value is 'abc' and another document with 'abcabc' , then 'abcabc' must always have to be declared more relevant than 'abc' by the custom similarity scoring process. This does not seems rights for some cases. For example i use ElasticSearch for text analysis and in most of the cases text analysis prefers to find the most matching/fitting term, so it prefers to find (has a higher score) 'abc' over 'abcabc'. How shall i overcome this issue? Is there any way to configure it in the settings to skip this particular check? Can you add a settings to skip this check? I want to upgrade from 6.7 to 7.0, but this way it seems impossible. Or maybe i misunderstand something? |
It has to do with the fact that we try to record the maximum score per block since 8.0. Since we can't actually store scores, which depend on index statistics that we don't have at index time such as doc frequency (total number of documents that contain a term across all segments), we actually store pairs of (term_frequency, norm). But then we don't want to record every unique pair that occurs within a block, so we automatically remove those that trigger less competitive scores than others and introduced the rule that scores can't decrease when term_frequency increases of norm decreases for that purpose. This way if one document has a (term_frequency, norm) pair of If you don't meet this requirement, Elasticsearch might fail to match some hits, because it will assume that their score is not competitive.
If your term is
Are you actually hitting this check? Can you share the definition of your scripted similarity? |
Sorry for the late reply and thank you for the information! So you try to store pairs of (term_frequency, norm). According to this article (https://www.elastic.co/guide/en/elasticsearch/reference/7.1/text.html) at section 'norms' you can disable norms for a field. What happens if i disable a norms for a field and i use a similarity on that field where score increases when norm increases (so it fails at your sanity check). Can i update the schema this way? Or it will throw this sanity error if i try to put the index settings and mappings? Since if you disabled norms, then i assume ES/Lucene can not calculate store these pairs, so this sanity check should not happen in that case. You asked whether i am hitting this check. Reinstalled ES 7.0.0 and tried to put the schema again and it turned out i was wrong, i bumped into the 3rd check : "scores must not increase when norm (interpreted as an unsigned long) increases.". Regarding the similarity script:
|
The latter, this will be rejected. Disabled norms are treated as if all documents had a norm value of 1. In that case postings only record a single pair, which is |
Lucene 8 introduced more constraints on similarities, in particular:
increases.
We can't check every single case, but could at least run some sanity checks.
Relates #33309