Matching difference between keyword and wildcard field types #78391

bpintea · 2021-09-28T14:09:04Z

A regexp query can yield different results depending on the type of the field it's run against, keyword vs. wildcard.
For example, a wildcard field value of bb matches a [a]*[a]+ regexp query, while a keyword field won't.
In fact, it seems that as soon as the query contains a *, in a leading or trailing group, plus another group with a repetition specifier (+, ?, {}, *), like in the prev example, the query will simply match any wildcard field value (including the empty string), as if just the * group was provided.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-09-28T14:09:07Z

Pinging @elastic/es-search (Team:Search)

markharwood · 2021-09-30T11:14:42Z

I can see what's happening. The regex parsing logic tries to optimise for the common match all scenario of .* but also other more complex expressions like (foo | bar)|.* where one of the optional items is a .*. In these cases it suggests there's no need to run an ngram query to accelerate matches or any verification query to check the regex against stored values and just opts for a much faster "match all" query.
This example shows that if there is a bug in the regex parsing and acceleration logic (which is complex and has to deal with all manner of unknown regexes) then it is possible for it to over-match.

This optimisation is probably dangerous logic so I suggest we only ever optimise for making the simple .* regex run a match-all and all other more complex forms of regex have to run the expensive verification step.

jimczi · 2021-09-30T12:09:57Z

This optimisation is probably dangerous logic so I suggest we only ever optimise for making the simple .* regex run a match-all and all other more complex forms of regex have to run the expensive verification step.

Agreed so let's remove the optimization entirely ? I doubt that users run into the optimized case, they are more for hypothetical bad patterns ?

markharwood · 2021-09-30T14:22:18Z

I doubt that users run into the optimized case, they are more for hypothetical bad patterns ?

Not necessarily "bad" - perhaps they're just unsimplified in the same way some JSON for Booleans presented to elasticsearch contains unnecessary levels of wrapping that can be simplified. Another perhaps common scenario is someone using a regex like a* which is a match all.

If we don't optimise these the verification costs are pretty high (decompress all docs, scan with regex).

I found the bug in our "simplify" logic. It happens only when there are single-character string clauses like the example.

We parse the regex into a BooleanQuery equivalent and then simplify. We rewrite the Boolean to match all if there's any match_all and all other clauses are optional. While looking for any non-optional clauses we only considered the concrete Prefix/TermQuery clauses - we have another clause type "MatchAllButRequireVerification" which is used as a catch-all "unknown" marker when we don't have an equivalent Term/PrefixQuery for a parsed regex clause (in this case a+ is a mandatory single-char and for performance reasons we chose not to search the ngram index for those). With no concrete Term/PrefixQuery in this parsed regex the simplification logic reverted to a match all.
The fix is simply to account for MatchAllButRequireVerification clauses even when there's no other concrete clauses.

We'll need to make a call on the trade off between safety vs performance and decide if we keep the optimisation.
Having spotted the bug I feel happier about the optimisation logic now and having considered some of the simple regexes that might be used in the wild like a* I'm now of the view we should fix the optimisation rather than remove.

Don’t revert to match_all when query only exists of required clauses that can’t be expressed as queries on ngram index. Closes elastic#78391

…l for regexes like .* Closes elastic#78391

Fix for wildcard field query optimisation that rewrites to a match all for regexes like .* A bug was found in this complex rewrite logic so we have simplified the detection of .* type regexes by examining the Automaton for the regex rather than our parsed form of it which is expressed as a Lucene BooleanQuery. The old logic relied on a recursive "simplify" function on the BooleanQuery which has now been removed. We now rely on Lucene's query rewrite logic to simplify expressions at query time and consequently some of the tests had to change to do some of this rewriting before running test comparisons. Closes #78391

Fix for wildcard field query optimisation that rewrites to a match all for regexes like .* A bug was found in this complex rewrite logic so we have simplified the detection of .* type regexes by examining the Automaton for the regex rather than our parsed form of it which is expressed as a Lucene BooleanQuery. The old logic relied on a recursive "simplify" function on the BooleanQuery which has now been removed. We now rely on Lucene's query rewrite logic to simplify expressions at query time and consequently some of the tests had to change to do some of this rewriting before running test comparisons. Closes elastic#78391

Fix for wildcard field query optimisation that rewrites to a match all for regexes like .* A bug was found in this complex rewrite logic so we have simplified the detection of .* type regexes by examining the Automaton for the regex rather than our parsed form of it which is expressed as a Lucene BooleanQuery. The old logic relied on a recursive "simplify" function on the BooleanQuery which has now been removed. We now rely on Lucene's query rewrite logic to simplify expressions at query time and consequently some of the tests had to change to do some of this rewriting before running test comparisons. Closes #78391

bpintea added >bug :Search/Search Search-related issues that do not fall into other categories labels Sep 28, 2021

elasticmachine added the Team:Search Meta label for search team label Sep 28, 2021

markharwood mentioned this issue Sep 30, 2021

Wildcard field regex query fix. #78520

Closed

markharwood added a commit to markharwood/elasticsearch that referenced this issue Oct 1, 2021

Wildcard field regex query fix.

3858d56

Don’t revert to match_all when query only exists of required clauses that can’t be expressed as queries on ngram index. Closes elastic#78391

markharwood mentioned this issue Oct 7, 2021

Wildcard field regex query fix #78839

Merged

markharwood added a commit to markharwood/elasticsearch that referenced this issue Oct 11, 2021

Fix for wildcard field query optimisation that rewrites to a match al…

a81c8e9

…l for regexes like .* Closes elastic#78391

markharwood closed this as completed in #78839 Oct 11, 2021

markharwood mentioned this issue Oct 11, 2021

Wildcard field regex query fix (#78839) #78925

Merged

This was referenced Dec 6, 2021

[Rule Tuning] Whitespace Padding in Process Command Line - Re-add index and add min_stack_version elastic/detection-rules#1648

Merged

[Bug] Whitespace Padding in Process Command Line rule matches every process elastic/detection-rules#1641

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching difference between keyword and wildcard field types #78391

Matching difference between keyword and wildcard field types #78391

bpintea commented Sep 28, 2021

elasticmachine commented Sep 28, 2021

markharwood commented Sep 30, 2021 •

edited

Loading

jimczi commented Sep 30, 2021

markharwood commented Sep 30, 2021 •

edited

Loading

Matching difference between keyword and wildcard field types #78391

Matching difference between keyword and wildcard field types #78391

Comments

bpintea commented Sep 28, 2021

elasticmachine commented Sep 28, 2021

markharwood commented Sep 30, 2021 • edited Loading

jimczi commented Sep 30, 2021

markharwood commented Sep 30, 2021 • edited Loading

markharwood commented Sep 30, 2021 •

edited

Loading

markharwood commented Sep 30, 2021 •

edited

Loading