-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matching difference between keyword and wildcard field types #78391
Comments
Pinging @elastic/es-search (Team:Search) |
I can see what's happening. The regex parsing logic tries to optimise for the common match all scenario of This optimisation is probably dangerous logic so I suggest we only ever optimise for making the simple |
Agreed so let's remove the optimization entirely ? I doubt that users run into the optimized case, they are more for hypothetical bad patterns ? |
Not necessarily "bad" - perhaps they're just unsimplified in the same way some JSON for Booleans presented to elasticsearch contains unnecessary levels of wrapping that can be simplified. Another perhaps common scenario is someone using a regex like If we don't optimise these the verification costs are pretty high (decompress all docs, scan with regex). I found the bug in our "simplify" logic. It happens only when there are single-character string clauses like the example. We parse the regex into a BooleanQuery equivalent and then simplify. We rewrite the Boolean to match all if there's any match_all and all other clauses are optional. While looking for any non-optional clauses we only considered the concrete Prefix/TermQuery clauses - we have another clause type "MatchAllButRequireVerification" which is used as a catch-all "unknown" marker when we don't have an equivalent Term/PrefixQuery for a parsed regex clause (in this case We'll need to make a call on the trade off between safety vs performance and decide if we keep the optimisation. |
Don’t revert to match_all when query only exists of required clauses that can’t be expressed as queries on ngram index. Closes elastic#78391
…l for regexes like .* Closes elastic#78391
Fix for wildcard field query optimisation that rewrites to a match all for regexes like .* A bug was found in this complex rewrite logic so we have simplified the detection of .* type regexes by examining the Automaton for the regex rather than our parsed form of it which is expressed as a Lucene BooleanQuery. The old logic relied on a recursive "simplify" function on the BooleanQuery which has now been removed. We now rely on Lucene's query rewrite logic to simplify expressions at query time and consequently some of the tests had to change to do some of this rewriting before running test comparisons. Closes #78391
Fix for wildcard field query optimisation that rewrites to a match all for regexes like .* A bug was found in this complex rewrite logic so we have simplified the detection of .* type regexes by examining the Automaton for the regex rather than our parsed form of it which is expressed as a Lucene BooleanQuery. The old logic relied on a recursive "simplify" function on the BooleanQuery which has now been removed. We now rely on Lucene's query rewrite logic to simplify expressions at query time and consequently some of the tests had to change to do some of this rewriting before running test comparisons. Closes elastic#78391
Fix for wildcard field query optimisation that rewrites to a match all for regexes like .* A bug was found in this complex rewrite logic so we have simplified the detection of .* type regexes by examining the Automaton for the regex rather than our parsed form of it which is expressed as a Lucene BooleanQuery. The old logic relied on a recursive "simplify" function on the BooleanQuery which has now been removed. We now rely on Lucene's query rewrite logic to simplify expressions at query time and consequently some of the tests had to change to do some of this rewriting before running test comparisons. Closes #78391
A regexp query can yield different results depending on the type of the field it's run against, keyword vs. wildcard.
For example, a wildcard field value of
bb
matches a[a]*[a]+
regexp query, while a keyword field won't.In fact, it seems that as soon as the query contains a
*
, in a leading or trailing group, plus another group with a repetition specifier (+, ?, {}, *), like in the prev example, the query will simply match any wildcard field value (including the empty string), as if just the*
group was provided.The text was updated successfully, but these errors were encountered: