-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster wildcard search #48852
Comments
Pinging @elastic/es-search (:Search/Search) |
I think we can try to extend the optimization to handle leading wildcard efficiently. It seems for example that this solution would allow to transform any query in the form of |
Oh I like this idea. Maybe instead of changing the gram size, we could index suffixes with multiple gram sizes. For instance under my first proposal,
|
Given this is intended to be supportive of EQL can we do some analysis of existing rules out there? |
+1 it would be useful to know the distribution of the number of chars between wildcards |
I had a scan of EQL rules they fell into these groups
|
So would this be an |
This is a good question. One difference with this way of indexing unlike indexing prefixes or shingles is that it can also help find exact matches. And I think that a number of users would rather like to save some space by only indexing ngrams, instead of indexing both the original keyword and its ngrams. So I'm leaning towards exposing a different field that only indexes ngrams? |
Fair enough. Will there be a specialised new query type for this too? Would we reject traditional token-based queries like |
I'm viewing this as a specialized keyword field for wildcard search, so I think it shouldn't have a specialized query type but reuse the existing ones based on the approach outlined above. And like keywords, it wouldn't support highlighting. |
Would interval queries be efficient here for checking the sequences of these expressions? |
I think we can't escape from having to run wildcards against doc values in some cases. For instance, if the query looks like |
OK - I'll code up the DV approach for all cases for now and we can optimise later for cases with longer patterns if we think intervals would be a big improvement. |
This sounds good to me. I'd like us to try both approaches to see how they compare, but my gut feeling is that the storage overhead of positions is going to be huge, and probably a barrier to adoption for many users. So only indexing ngrams with |
For the verification phase we may have to consider some of the flags we see in regex queries to do with "greedy" evaluation. |
Can we reuse the same logic as |
Is there mileage in thinking of this feature more broadly as a form of index-backed regex with its own query type? I guess we could add the field type for now and dream up new query syntax later. It might make us want to reconsider the name of the field type in the first instance though. |
I was thinking that this field would be a good fit for regexp or fuzzy queries indeed. Why would you create a custom query type, we could reuse the existing regexp and fuzzy queries? |
I was thinking that if we opt for the pos-based implementation the regex functionality we could offer might be constrained by what Interval queries can support. This would create a divergence in functionality between "real" regex and what this new field type could support |
OK, let's focus on the positions vs. doc-values discussion first then. I'll write some thoughts about this on the PR. |
I just have one question... Why can I not find public documentation ANYWHERE on this new field? |
Thats a good one, and a difficult one to answer because the documentation is right here. Hope that helps. |
While wildcard queries are often a symptom that content hasn't been properly analyzed, there are also cases when users really need the flexibility of wildcard queries. Could we leverage ngrams to make this kind of queries faster, similarly to how we made
prefix
queries faster ontext
fields by indexing edge ngrams?I tried to think about how this could work and here is an idea:
*foobar
) and prefix queries faster (foobar*
).Here are some example queries and how they would run internally with this setup:
*foobar*
fooba AND oobar
foobar*
\0foob AND fooba AND oobar
*foobar
fooba AND oobar AND obar\0
*foo*
*foo*
foo*
\0foo*
*foo
*foo\0
foobar
\0foob AND fooba AND oobar AND obar\0
*foobar*quux*
fooba AND oobar
*quux*
, the approximation we have might be selective enough alreadyNotes:
Questions:
keyword
indexing?text
/keyword
fields or a new field? This way of indexing allows to run exact queries too so we wouldn't need to index as a keyword too in addition to these ngrams.The text was updated successfully, but these errors were encountered: