Use index-prefix fields for terms of length min_chars - 1 #36703

romseygeek · 2018-12-17T12:22:17Z

The default index_prefix settings will index prefixes of between 2 and 5 characters in length. Currently, if a prefix search falls outside of this range at either end we fall back to a standard prefix expansion, which is still very expensive for single character prefixes. However, we have an option here to use a wildcard expansion rather than a prefix expansion, so that a query of a* gets remapped to a? against the _index_prefix field - likely to be a very small set of terms, and certain to be much smaller than a* against the whole index.

This pull request adds this extra level of mapping for any prefix term whose length is one less than the min_chars parameter of the index_prefixes field.

A possible follow-up could be to disallow single-character wildcards against a field unless index_prefix is enabled with a min_char settings of 2.

elasticmachine · 2018-12-17T12:22:19Z

Pinging @elastic/es-search

romseygeek · 2018-12-17T15:06:53Z

@elasticmachine retest this please

jpountz

This is a great idea! I left some suggestions.

jpountz · 2018-12-17T15:12:11Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+            if (strValue.length() >= minChars) {
+                return super.termQuery(value, context);
+            }
+            WildcardQuery query = new WildcardQuery(new Term(name(), value + "?"));


I think it'd be safer to create an automaton manually and then instantiate an AutomatonQuery, otherwise there could be surprises if value contains ? or *.

jpountz · 2018-12-17T15:20:40Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

@@ -360,7 +361,7 @@ PrefixFieldType setAnalyzer(NamedAnalyzer delegate) {
        }

        boolean accept(int length) {
-            return length >= minChars && length <= maxChars;
+            return length >= minChars - 1 && length <= maxChars;


Let's go even further, change this to just length <= maxChars, and then append minChars - prefixTerm.length wildcards to the automaton that is used for querying?

jpountz · 2018-12-17T17:10:50Z

Now that I'm thinking about it again... I'm afraid that this refactoring will fail to match terms whose length is exactly minChars-1? my_text_field:a* should actually be parsed to something like my_text_field:a OR my_text_field.index_prefix:a??

jpountz · 2018-12-17T17:11:59Z

It probably also means that we should not try to support prefixes whose length is less than minChars - 1 like I suggested above.

romseygeek · 2018-12-17T19:43:40Z

It probably also means that we should not try to support prefixes whose length is less than minChars - 1 like I suggested above.

What about text fields that don't have index_prefixes set on them? Should we disallow prefix queries below a certain length entirely there, and say that if you want prefix search for short prefixes you need to use index_prefixes?

jtibshirani

I had a couple questions for my knowledge, to help understand the trade-offs we're making: in what circumstances would users adjust the min_chars setting, and why does it default to 2 as opposed to 1?

jtibshirani · 2018-12-17T19:30:10Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

@@ -370,6 +375,23 @@ void doXContent(XContentBuilder builder) throws IOException {
            builder.endObject();
        }

+        public Query termQuery(Object value, MultiTermQuery.RewriteMethod method, QueryShardContext context) {


Maybe we should rename this method, now that it is not a pure term query (for example prefixQuery could make more sense)?

I made this just override prefixQuery, it makes much more sense!

jtibshirani · 2018-12-17T19:51:00Z

server/src/test/java/org/elasticsearch/index/mapper/TextFieldMapperTests.java

            q = fieldType.prefixQuery("internationalisatio", CONSTANT_SCORE_REWRITE, queryShardContext);
            assertEquals(new PrefixQuery(new Term("field", "internationalisatio")), q);

+            q = fieldType.prefixQuery("g", CONSTANT_SCORE_REWRITE, queryShardContext);


It seems like these detailed query construction tests would fit better in a unit test like TextFieldTypeTests.

romseygeek · 2018-12-18T11:07:47Z

Thanks @jtibshirani, have pushed some changes to address your comments.

in what circumstances would users adjust the min_chars setting, and why does it default to 2 as opposed to 1?

The reasoning is that each extra ngram length adds to index size; so min_chars of 1 will end up with a very large index indeed, and 2 seems to be a reasonable default. But if you know that you will only ever do prefix searches of length 4 or more, for example, then you can up the min_chars setting to save on disk space.

jtibshirani · 2018-12-19T21:58:58Z

@romseygeek I'm wondering if you addressed @jpountz's comment above?

I'm afraid that this refactoring will fail to match terms whose length is exactly minChars-1? my_text_field:a* should actually be parsed to something like my_text_field:a OR my_text_field.index_prefix:a??

romseygeek · 2018-12-20T09:23:05Z

I'm wondering if you addressed @jpountz's comment above?

I had missed that, thank you! Will open a separate PR to deal with it.

jpountz · 2018-12-20T10:34:28Z

Should we disallow prefix queries below a certain length entirely there, and say that if you want prefix search for short prefixes you need to use index_prefixes?

I can't find the Github issue now, but it has been occasionally asked that we add a flag that allows to disable slow queries entirely, such as multi-term queries that match lots of terms. We could have a switch for all queries rather than only prefix queries, eg. by enforcing a rewrite method that fails if more than X terms match? And index_prefixes would be a way to avoid hitting this limit for prefix queries?

Use index-prefix fields for terms of length n-1

d83e34b

romseygeek added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types v7.0.0 v6.7.0 labels Dec 17, 2018

romseygeek self-assigned this Dec 17, 2018

Yaml syntax

2eda0b2

romseygeek requested review from jpountz and jtibshirani December 17, 2018 13:26

jpountz reviewed Dec 17, 2018

View reviewed changes

romseygeek added 2 commits December 17, 2018 15:47

feedback

475a87b

Better min length test

76c67aa

Switch back to n-1 at most

e964ced

checkstyle

f4d6105

jtibshirani reviewed Dec 17, 2018

View reviewed changes

romseygeek added 2 commits December 18, 2018 11:04

feedback

274cebd

Merge remote-tracking branch 'origin/master' into single-char-prefix

86480e7

romseygeek merged commit dd540ef into elastic:master Dec 19, 2018

romseygeek deleted the single-char-prefix branch December 19, 2018 08:55

romseygeek mentioned this pull request Jan 15, 2019

Allow field types to optimize phrase prefix queries #37436

Merged

colings86 added the v7.0.0-beta1 label Feb 7, 2019

colings86 removed the v7.0.0 label Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use index-prefix fields for terms of length min_chars - 1 #36703

Use index-prefix fields for terms of length min_chars - 1 #36703

romseygeek commented Dec 17, 2018 •

edited

Loading

elasticmachine commented Dec 17, 2018

romseygeek commented Dec 17, 2018

jpountz left a comment

jpountz Dec 17, 2018

jpountz Dec 17, 2018

jpountz commented Dec 17, 2018

jpountz commented Dec 17, 2018

romseygeek commented Dec 17, 2018

jtibshirani left a comment

jtibshirani Dec 17, 2018

romseygeek Dec 18, 2018

jtibshirani Dec 17, 2018

romseygeek Dec 18, 2018

romseygeek commented Dec 18, 2018

jtibshirani commented Dec 19, 2018

romseygeek commented Dec 20, 2018

jpountz commented Dec 20, 2018

Use index-prefix fields for terms of length min_chars - 1 #36703

Use index-prefix fields for terms of length min_chars - 1 #36703

Conversation

romseygeek commented Dec 17, 2018 • edited Loading

elasticmachine commented Dec 17, 2018

romseygeek commented Dec 17, 2018

jpountz left a comment

Choose a reason for hiding this comment

jpountz Dec 17, 2018

Choose a reason for hiding this comment

jpountz Dec 17, 2018

Choose a reason for hiding this comment

jpountz commented Dec 17, 2018

jpountz commented Dec 17, 2018

romseygeek commented Dec 17, 2018

jtibshirani left a comment

Choose a reason for hiding this comment

jtibshirani Dec 17, 2018

Choose a reason for hiding this comment

romseygeek Dec 18, 2018

Choose a reason for hiding this comment

jtibshirani Dec 17, 2018

Choose a reason for hiding this comment

romseygeek Dec 18, 2018

Choose a reason for hiding this comment

romseygeek commented Dec 18, 2018

jtibshirani commented Dec 19, 2018

romseygeek commented Dec 20, 2018

jpountz commented Dec 20, 2018

romseygeek commented Dec 17, 2018 •

edited

Loading