Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use index-prefix fields for terms of length min_chars - 1 #36703

Merged
merged 8 commits into from
Dec 19, 2018

Conversation

romseygeek
Copy link
Contributor

@romseygeek romseygeek commented Dec 17, 2018

The default index_prefix settings will index prefixes of between 2 and 5 characters in length. Currently, if a prefix search falls outside of this range at either end we fall back to a standard prefix expansion, which is still very expensive for single character prefixes. However, we have an option here to use a wildcard expansion rather than a prefix expansion, so that a query of a* gets remapped to a? against the _index_prefix field - likely to be a very small set of terms, and certain to be much smaller than a* against the whole index.

This pull request adds this extra level of mapping for any prefix term whose length is one less than the min_chars parameter of the index_prefixes field.

A possible follow-up could be to disallow single-character wildcards against a field unless index_prefix is enabled with a min_char settings of 2.

@romseygeek romseygeek added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types v7.0.0 v6.7.0 labels Dec 17, 2018
@romseygeek romseygeek self-assigned this Dec 17, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@romseygeek
Copy link
Contributor Author

@elasticmachine retest this please

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea! I left some suggestions.

if (strValue.length() >= minChars) {
return super.termQuery(value, context);
}
WildcardQuery query = new WildcardQuery(new Term(name(), value + "?"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be safer to create an automaton manually and then instantiate an AutomatonQuery, otherwise there could be surprises if value contains ? or *.

@@ -360,7 +361,7 @@ PrefixFieldType setAnalyzer(NamedAnalyzer delegate) {
}

boolean accept(int length) {
return length >= minChars && length <= maxChars;
return length >= minChars - 1 && length <= maxChars;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go even further, change this to just length <= maxChars, and then append minChars - prefixTerm.length wildcards to the automaton that is used for querying?

@jpountz
Copy link
Contributor

jpountz commented Dec 17, 2018

Now that I'm thinking about it again... I'm afraid that this refactoring will fail to match terms whose length is exactly minChars-1? my_text_field:a* should actually be parsed to something like my_text_field:a OR my_text_field.index_prefix:a??

@jpountz
Copy link
Contributor

jpountz commented Dec 17, 2018

It probably also means that we should not try to support prefixes whose length is less than minChars - 1 like I suggested above.

@romseygeek
Copy link
Contributor Author

It probably also means that we should not try to support prefixes whose length is less than minChars - 1 like I suggested above.

What about text fields that don't have index_prefixes set on them? Should we disallow prefix queries below a certain length entirely there, and say that if you want prefix search for short prefixes you need to use index_prefixes?

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a couple questions for my knowledge, to help understand the trade-offs we're making: in what circumstances would users adjust the min_chars setting, and why does it default to 2 as opposed to 1?

@@ -370,6 +375,23 @@ void doXContent(XContentBuilder builder) throws IOException {
builder.endObject();
}

public Query termQuery(Object value, MultiTermQuery.RewriteMethod method, QueryShardContext context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should rename this method, now that it is not a pure term query (for example prefixQuery could make more sense)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this just override prefixQuery, it makes much more sense!

q = fieldType.prefixQuery("internationalisatio", CONSTANT_SCORE_REWRITE, queryShardContext);
assertEquals(new PrefixQuery(new Term("field", "internationalisatio")), q);

q = fieldType.prefixQuery("g", CONSTANT_SCORE_REWRITE, queryShardContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like these detailed query construction tests would fit better in a unit test like TextFieldTypeTests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@romseygeek
Copy link
Contributor Author

Thanks @jtibshirani, have pushed some changes to address your comments.

in what circumstances would users adjust the min_chars setting, and why does it default to 2 as opposed to 1?

The reasoning is that each extra ngram length adds to index size; so min_chars of 1 will end up with a very large index indeed, and 2 seems to be a reasonable default. But if you know that you will only ever do prefix searches of length 4 or more, for example, then you can up the min_chars setting to save on disk space.

@romseygeek romseygeek merged commit dd540ef into elastic:master Dec 19, 2018
@romseygeek romseygeek deleted the single-char-prefix branch December 19, 2018 08:55
@jtibshirani
Copy link
Contributor

@romseygeek I'm wondering if you addressed @jpountz's comment above?

I'm afraid that this refactoring will fail to match terms whose length is exactly minChars-1? my_text_field:a* should actually be parsed to something like my_text_field:a OR my_text_field.index_prefix:a??

@romseygeek
Copy link
Contributor Author

I'm wondering if you addressed @jpountz's comment above?

I had missed that, thank you! Will open a separate PR to deal with it.

@jpountz
Copy link
Contributor

jpountz commented Dec 20, 2018

Should we disallow prefix queries below a certain length entirely there, and say that if you want prefix search for short prefixes you need to use index_prefixes?

I can't find the Github issue now, but it has been occasionally asked that we add a flag that allows to disable slow queries entirely, such as multi-term queries that match lots of terms. We could have a switch for all queries rather than only prefix queries, eg. by enforcing a rewrite method that fails if more than X terms match? And index_prefixes would be a way to avoid hitting this limit for prefix queries?

@colings86 colings86 removed the v7.0.0 label Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants