Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching difference between keyword and wildcard field types #78391

Closed
bpintea opened this issue Sep 28, 2021 · 4 comments · Fixed by #78839
Closed

Matching difference between keyword and wildcard field types #78391

bpintea opened this issue Sep 28, 2021 · 4 comments · Fixed by #78839
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@bpintea
Copy link
Contributor

bpintea commented Sep 28, 2021

A regexp query can yield different results depending on the type of the field it's run against, keyword vs. wildcard.
For example, a wildcard field value of bb matches a [a]*[a]+ regexp query, while a keyword field won't.
In fact, it seems that as soon as the query contains a *, in a leading or trailing group, plus another group with a repetition specifier (+, ?, {}, *), like in the prev example, the query will simply match any wildcard field value (including the empty string), as if just the * group was provided.

@bpintea bpintea added >bug :Search/Search Search-related issues that do not fall into other categories labels Sep 28, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Sep 28, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@markharwood
Copy link
Contributor

markharwood commented Sep 30, 2021

I can see what's happening. The regex parsing logic tries to optimise for the common match all scenario of .* but also other more complex expressions like (foo | bar)|.* where one of the optional items is a .*. In these cases it suggests there's no need to run an ngram query to accelerate matches or any verification query to check the regex against stored values and just opts for a much faster "match all" query.
This example shows that if there is a bug in the regex parsing and acceleration logic (which is complex and has to deal with all manner of unknown regexes) then it is possible for it to over-match.

This optimisation is probably dangerous logic so I suggest we only ever optimise for making the simple .* regex run a match-all and all other more complex forms of regex have to run the expensive verification step.

@jimczi
Copy link
Contributor

jimczi commented Sep 30, 2021

This optimisation is probably dangerous logic so I suggest we only ever optimise for making the simple .* regex run a match-all and all other more complex forms of regex have to run the expensive verification step.

Agreed so let's remove the optimization entirely ? I doubt that users run into the optimized case, they are more for hypothetical bad patterns ?

@markharwood
Copy link
Contributor

markharwood commented Sep 30, 2021

I doubt that users run into the optimized case, they are more for hypothetical bad patterns ?

Not necessarily "bad" - perhaps they're just unsimplified in the same way some JSON for Booleans presented to elasticsearch contains unnecessary levels of wrapping that can be simplified. Another perhaps common scenario is someone using a regex like a* which is a match all.

If we don't optimise these the verification costs are pretty high (decompress all docs, scan with regex).

I found the bug in our "simplify" logic. It happens only when there are single-character string clauses like the example.

We parse the regex into a BooleanQuery equivalent and then simplify. We rewrite the Boolean to match all if there's any match_all and all other clauses are optional. While looking for any non-optional clauses we only considered the concrete Prefix/TermQuery clauses - we have another clause type "MatchAllButRequireVerification" which is used as a catch-all "unknown" marker when we don't have an equivalent Term/PrefixQuery for a parsed regex clause (in this case a+ is a mandatory single-char and for performance reasons we chose not to search the ngram index for those). With no concrete Term/PrefixQuery in this parsed regex the simplification logic reverted to a match all.
The fix is simply to account for MatchAllButRequireVerification clauses even when there's no other concrete clauses.

We'll need to make a call on the trade off between safety vs performance and decide if we keep the optimisation.
Having spotted the bug I feel happier about the optimisation logic now and having considered some of the simple regexes that might be used in the wild like a* I'm now of the view we should fix the optimisation rather than remove.

markharwood added a commit to markharwood/elasticsearch that referenced this issue Oct 1, 2021
Don’t revert to match_all when query only exists of required clauses that can’t be expressed as queries on ngram index.

Closes elastic#78391
markharwood added a commit to markharwood/elasticsearch that referenced this issue Oct 11, 2021
markharwood added a commit that referenced this issue Oct 11, 2021
Fix for wildcard field query optimisation that rewrites to a match all for regexes like .*

A bug was found in this complex rewrite logic so we have simplified the detection of .* type regexes by examining the Automaton for the regex rather than our parsed form of it which is expressed as a Lucene BooleanQuery. The old logic relied on a recursive "simplify" function on the BooleanQuery which has now been removed. We now rely on Lucene's query rewrite logic to simplify expressions at query time and consequently some of the tests had to change to do some of this rewriting before running test comparisons.

Closes #78391
markharwood added a commit to markharwood/elasticsearch that referenced this issue Oct 11, 2021
Fix for wildcard field query optimisation that rewrites to a match all for regexes like .*

A bug was found in this complex rewrite logic so we have simplified the detection of .* type regexes by examining the Automaton for the regex rather than our parsed form of it which is expressed as a Lucene BooleanQuery. The old logic relied on a recursive "simplify" function on the BooleanQuery which has now been removed. We now rely on Lucene's query rewrite logic to simplify expressions at query time and consequently some of the tests had to change to do some of this rewriting before running test comparisons.

Closes elastic#78391
markharwood added a commit that referenced this issue Oct 11, 2021
Fix for wildcard field query optimisation that rewrites to a match all for regexes like .*

A bug was found in this complex rewrite logic so we have simplified the detection of .* type regexes by examining the Automaton for the regex rather than our parsed form of it which is expressed as a Lucene BooleanQuery. The old logic relied on a recursive "simplify" function on the BooleanQuery which has now been removed. We now rely on Lucene's query rewrite logic to simplify expressions at query time and consequently some of the tests had to change to do some of this rewriting before running test comparisons.

Closes #78391
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
4 participants