-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numeric terms support #94048
base: main
Are you sure you want to change the base?
Numeric terms support #94048
Conversation
The inverted terms index can be orders of magnitude faster for terms & terms queries when compared with Lucene Points (i.e. bkd trees) bkd trees are optimized for range-style queries but not for exact matches
Pinging @elastic/es-search (Team:Search) |
This would be really nice to have -- not storing numerics in the inverted index is a huge regression for us. Can we look into this? We get that you can store them as keywords instead but due to limitations in our existing client-side infrastructure it is too much of a lift to index all numerics as keyword+number multi-fields in order to both support fast terms & range queries. Choosing which of the two fields client-side to query would be a massive overhaul for our ecosystem. |
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
Issue: #94047
Background
Pre Elastic 5.x, numerics & dates used to be indexed into an inverted terms index. However, this PR changed the underlying data structure to Lucene points (implemented as binary KD trees). Lucene Points offer performance gains for range queries. However, they are expected to perform worse for exact queries (i.e. term/terms queries). This poses a problem when numeric identifiers (whose query patterns tend to be term-heavy) are stored as numbers.
Elastic's recommendation from ES 5.x forward is to store numeric identifiers as a stringified
keyword
in order to get the query performance boost for term/terms queries. This poses a several limitations:keyword
instead of as along
,integer
, etc without having an intimate knowledge of modern ES's data structures. Especially as other databases/stores (like MySql or just about any other structured DB would recommend using a numeric type) and because the guidance prior to ES 5.x was to use a numeric identifier. Furthermore, the guidance in the ES documentation is somewhat hidden and hard to find (for new ES users and for old ES users who are accustomed to pre ES 5.x functionality).keyword
means we lose the ability to do range queries while preserving the numeric ordering of the terms. Of course, this can be overcome by indexing the number into a multi-field as both a numeric and akeyword
. However, doing so requires changing thousands of lines of client-side code to choose between the fields for terms vs range queries.keyword
field also means we must eagerly initialized global ordinals to support fast term aggregations because they require global ordinals that are lazily loaded rather than the doc values that numeric fields have. Eagerly initialized global ordinals have demonstrated 10-15x query speed improvement, but also leads to increased indexing overhead and memory utilization. We do have certain high-cardinality numeric identifiers where we perform routine terms aggregations (e.g. for customer identifiers which can reach into the millions of unique terms). Again, perhaps executing the terms aggregation on a numeric within a multi-field will solve this but it leads to heavy client-side code changes.keyword
means thatterms
queries get split in a Boolean query with separate disjunctive clauses for each term. This causes the query to fail if the term count is higher than the configured max_clause_count setting even though it may be well within the configured max_terms_count index setting of 65,536.All in all, for our application we created a plugin that re-implements the native ES numeric fields, indexing the terms into Lucene Points AND into a inverted terms index. Doing so has provided several benefits:
terms
queries on akeyword
due tomax_clause_count
(whose default is far less than themax_terms_count
for a terms query).It would be much simpler if we can index the data into the appropriate data structures within a single field where the user can opt-in to enabling a term-index similar to how doc-values can be enabled & disabled. In this approach, by default, numerics will continue to operate as they always have. However, the user may specify a field setting called
terms: (true|false)
on the index mapping to define whether the numeric should additionally be indexed into an inverted terms index. I've applied the change to all numerics anddate
fields.Spot the improvement in our query response times when we switched to using a numeric terms index to fulfill our term/terms queries:
I am posting this PR more as a conversation starter. I understand if this may not be the Elastic community's preferred approach, but I do hope the community will consider some of these changes given the limitations of the current system and the substantial benefits we have seen on our own workloads as described above.