-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use index-prefix fields for terms of length min_chars - 1 #36703
Conversation
Pinging @elastic/es-search |
@elasticmachine retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea! I left some suggestions.
if (strValue.length() >= minChars) { | ||
return super.termQuery(value, context); | ||
} | ||
WildcardQuery query = new WildcardQuery(new Term(name(), value + "?")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be safer to create an automaton manually and then instantiate an AutomatonQuery, otherwise there could be surprises if value
contains ?
or *
.
@@ -360,7 +361,7 @@ PrefixFieldType setAnalyzer(NamedAnalyzer delegate) { | |||
} | |||
|
|||
boolean accept(int length) { | |||
return length >= minChars && length <= maxChars; | |||
return length >= minChars - 1 && length <= maxChars; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go even further, change this to just length <= maxChars
, and then append minChars - prefixTerm.length
wildcards to the automaton that is used for querying?
Now that I'm thinking about it again... I'm afraid that this refactoring will fail to match terms whose length is exactly |
It probably also means that we should not try to support prefixes whose length is less than minChars - 1 like I suggested above. |
What about text fields that don't have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a couple questions for my knowledge, to help understand the trade-offs we're making: in what circumstances would users adjust the min_chars
setting, and why does it default to 2 as opposed to 1?
@@ -370,6 +375,23 @@ void doXContent(XContentBuilder builder) throws IOException { | |||
builder.endObject(); | |||
} | |||
|
|||
public Query termQuery(Object value, MultiTermQuery.RewriteMethod method, QueryShardContext context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should rename this method, now that it is not a pure term query (for example prefixQuery
could make more sense)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this just override prefixQuery
, it makes much more sense!
q = fieldType.prefixQuery("internationalisatio", CONSTANT_SCORE_REWRITE, queryShardContext); | ||
assertEquals(new PrefixQuery(new Term("field", "internationalisatio")), q); | ||
|
||
q = fieldType.prefixQuery("g", CONSTANT_SCORE_REWRITE, queryShardContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like these detailed query construction tests would fit better in a unit test like TextFieldTypeTests
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
Thanks @jtibshirani, have pushed some changes to address your comments.
The reasoning is that each extra ngram length adds to index size; so min_chars of 1 will end up with a very large index indeed, and 2 seems to be a reasonable default. But if you know that you will only ever do prefix searches of length 4 or more, for example, then you can up the min_chars setting to save on disk space. |
@romseygeek I'm wondering if you addressed @jpountz's comment above?
|
I had missed that, thank you! Will open a separate PR to deal with it. |
I can't find the Github issue now, but it has been occasionally asked that we add a flag that allows to disable slow queries entirely, such as multi-term queries that match lots of terms. We could have a switch for all queries rather than only prefix queries, eg. by enforcing a rewrite method that fails if more than X terms match? And |
The default
index_prefix
settings will index prefixes of between 2 and 5 characters in length. Currently, if a prefix search falls outside of this range at either end we fall back to a standard prefix expansion, which is still very expensive for single character prefixes. However, we have an option here to use a wildcard expansion rather than a prefix expansion, so that a query ofa*
gets remapped toa?
against the_index_prefix
field - likely to be a very small set of terms, and certain to be much smaller thana*
against the whole index.This pull request adds this extra level of mapping for any prefix term whose length is one less than the
min_chars
parameter of theindex_prefixes
field.A possible follow-up could be to disallow single-character wildcards against a field unless
index_prefix
is enabled with amin_char
settings of 2.