-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wildcard field optimised for wildcard queries #49993
Conversation
Pinging @elastic/es-search (:Search/Mapping) |
9b297fd
to
452987a
Compare
This PR is currently missing support for arrays. Should it add that? |
414b3de
to
49bcf70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to support arrays and require wildcards to match within array elements, like for keyword
fields. I left some comments, I'm curious whether you've had a chance to check the space overhead compared to a keyword
field?
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/XPackSettings.java
Outdated
Show resolved
Hide resolved
.../wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/TaperedNgramTokenFilter.java
Outdated
Show resolved
Hide resolved
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Show resolved
Hide resolved
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Outdated
Show resolved
Hide resolved
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Outdated
Show resolved
Hide resolved
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Outdated
Show resolved
Hide resolved
fields.add(field); | ||
Field dvField = new BinaryDocValuesField(fieldType().name(), new BytesRef(value)); | ||
fields.add(dvField); | ||
if (fieldType().omitNorms()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is what we would do for a field that doesn't have doc values, like text
. This field requires doc values anyway so we can skip this if statement entirely and create exists
queries from doc values.
...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java
Show resolved
Hide resolved
3073697
to
96c5484
Compare
public class WildcardOnDvQuery extends Query { | ||
|
||
private String field; | ||
private String wildcardPattern; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make them private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I meant final
|
||
@Override | ||
public boolean matches() throws IOException { | ||
if (values.advanceExact(approximation.docID())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other doc values queries instead use values
as an approximation, which removes the need to call advanceExact
here
long ord = values.nextOrd(); | ||
while (ord != SortedSetDocValues.NO_MORE_ORDS) { | ||
BytesRef value = values.lookupOrd(ord); | ||
if (bytesMatcher.run(value.bytes, 0, value.length)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use value.offset
even if it is always 0 in practice with the current codec
Thanks for the review, @jpountz
|
Actually our general recommendation is rather to avoid integration tests. In that case, we'd do extensive testing about whether it matches wildcard queries correctly in unit tests, and have a YAML test mostly to ensure that everything is wired together correctly. And no EsIntegTestCase. |
I'm also curious to get some more performance/disk-overhead data before doing a more in-depth review, in order to know whether we need to go back to the drawing board or whether we want to move forward with this approach? |
Thinking more about the performance comparison, I suspect we will want to check at least two scenarios:
I'm expecting the wildcard_keyword field to be sometimes faster sometimes slower for 1 (depending on the size of the terms dict and whether there is a leading wildcard) but consistently faster for 2? |
Some preliminary results from the current wildcard-on-DV approach.
Most of the pattern matches are rare but when a search matches many docs eg the 7,886 result we can see an increase in the cost of the wildcard approach. |
I wonder if we can eliminate some of the costs involved in storing/scanning machine-generated strings (e.g. stacktraces) if we can identify common subsequences at index time. The significant text aggregation has some logic to do this in the DeDuplicatingTokenFilter. Sequences of 6 words that are repeated in a stream of docs are identified using a low-cost lookup. The significant text aggregation removes whole sections of text that are seen as duplicate but we could potentially use the same structure to store and search more compact representations of near-duplicate strings. |
As for disk usage - my test log index was a total of 366mb for keyword only |
Possible issue with position-based matching when we have array fields (at least in terms of behaviour compared to wildcard search on |
Hmm this is a good point. I believe using positions would also have the same downside as sloppy phrase queries have about missing matches, since they don't (and maybe can't realistically because of the combinatorial explosion) explore all matches. |
Unless @jimczi feels otherwise, then I guess positions are not really an option and we should just explore the path about indexing ngrams with docs only (no freqs or positions) and using doc values to verify? |
I'm hoping to get some numbers tomorrow on position-based search speeds so we can see what we are throwing away if we decide to not go down this route |
Ah, if you are close to having some numbers that would be great indeed! |
I've not got 100% compatibility with keyword+wildcard matching behaviour but close enough for benchmarks I expect |
We only need an ordered query so I am not sure about the combinatorial explosion ? Phrase queries with slop allows some reordering so that's out of scope for this field. Intervals query would work well I think but we need to fix the case where there are duplicates in the query first:
That's a good point indeed and maybe something that we can restrict or explain in the documentation but imo this shouldn't eliminate this option completely.
Speed and index size, that's my main motivation to test this since doc values seem overkill if we don't want to restrict this to small fields only. |
Ah, you are right, I got confused. The fact that terms within the phrase must be ordered avoids the combinatorial explosion issue that sloppy phrase queries have indeed. |
I got confused too ;), the issue in apache/lucene-solr#1097 does not affect strictly ordered query so using Intervals to validate our ngrams with positions should work. |
Index size comparisonsAll force-merged copies of the same elasticsearch.log strings.
|
…in Builder for invalid options,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @markharwood for all the iterations on this!
Indexes values using size 3 ngrams and also stores the full original as a binary doc value. Wildcard queries operate by using a cheap approximation query on the ngram field followed up by a more expensive verification query using an automaton on the binary doc values. Also supports aggregations and sorting.
* New wildcard field optimised for wildcard queries (#49993) Indexes values using size 3 ngrams and also stores the full original as a binary doc value. Wildcard queries operate by using a cheap approximation query on the ngram field followed up by a more expensive verification query using an automaton on the binary doc values. Also supports aggregations and sorting.
Relates: elastic/elasticsearch#49993 Co-authored-by: Russ Cam <[email protected]>
Relates: elastic/elasticsearch#49993 Co-authored-by: Russ Cam <[email protected]>
First cut at the wildcard field type.
Closes #48852