Numeric terms support #94048

rkophs · 2023-02-23T04:29:52Z

Issue: #94047

Background

Pre Elastic 5.x, numerics & dates used to be indexed into an inverted terms index. However, this PR changed the underlying data structure to Lucene points (implemented as binary KD trees). Lucene Points offer performance gains for range queries. However, they are expected to perform worse for exact queries (i.e. term/terms queries). This poses a problem when numeric identifiers (whose query patterns tend to be term-heavy) are stored as numbers.

Elastic's recommendation from ES 5.x forward is to store numeric identifiers as a stringified keyword in order to get the query performance boost for term/terms queries. This poses a several limitations:

It is not intuitive to a user that a numeric identifier should be stored as a keyword instead of as a long, integer, etc without having an intimate knowledge of modern ES's data structures. Especially as other databases/stores (like MySql or just about any other structured DB would recommend using a numeric type) and because the guidance prior to ES 5.x was to use a numeric identifier. Furthermore, the guidance in the ES documentation is somewhat hidden and hard to find (for new ES users and for old ES users who are accustomed to pre ES 5.x functionality).
Using a keyword means we lose the ability to do range queries while preserving the numeric ordering of the terms. Of course, this can be overcome by indexing the number into a multi-field as both a numeric and a keyword. However, doing so requires changing thousands of lines of client-side code to choose between the fields for terms vs range queries.
Using a keyword field also means we must eagerly initialized global ordinals to support fast term aggregations because they require global ordinals that are lazily loaded rather than the doc values that numeric fields have. Eagerly initialized global ordinals have demonstrated 10-15x query speed improvement, but also leads to increased indexing overhead and memory utilization. We do have certain high-cardinality numeric identifiers where we perform routine terms aggregations (e.g. for customer identifiers which can reach into the millions of unique terms). Again, perhaps executing the terms aggregation on a numeric within a multi-field will solve this but it leads to heavy client-side code changes.
Using a keyword means that terms queries get split in a Boolean query with separate disjunctive clauses for each term. This causes the query to fail if the term count is higher than the configured max_clause_count setting even though it may be well within the configured max_terms_count index setting of 65,536.

All in all, for our application we created a plugin that re-implements the native ES numeric fields, indexing the terms into Lucene Points AND into a inverted terms index. Doing so has provided several benefits:

We have seen query speeds for term/terms queries improve 10-100x for some of our larger workloads.
No client-side code was necessary to handle a multi-field.
Doc-values are used for the terms aggregations (no need to eagerly initialize global ordinals) leading to faster response times for terms aggregations, faster indexing speed and much smaller memory overhead.
We no longer have failing terms queries on a keyword due to max_clause_count (whose default is far less than the max_terms_count for a terms query).

It would be much simpler if we can index the data into the appropriate data structures within a single field where the user can opt-in to enabling a term-index similar to how doc-values can be enabled & disabled. In this approach, by default, numerics will continue to operate as they always have. However, the user may specify a field setting called terms: (true|false) on the index mapping to define whether the numeric should additionally be indexed into an inverted terms index. I've applied the change to all numerics and date fields.

Spot the improvement in our query response times when we switched to using a numeric terms index to fulfill our term/terms queries:

I am posting this PR more as a conversation starter. I understand if this may not be the Elastic community's preferred approach, but I do hope the community will consider some of these changes given the limitations of the current system and the substantial benefits we have seen on our own workloads as described above.

The inverted terms index can be orders of magnitude faster for terms & terms queries when compared with Lucene Points (i.e. bkd trees) bkd trees are optimized for range-style queries but not for exact matches

elasticsearchmachine · 2023-02-23T12:11:55Z

Pinging @elastic/es-search (Team:Search)

iamkeyur · 2023-12-14T19:04:32Z

This would be really nice to have -- not storing numerics in the inverted index is a huge regression for us. Can we look into this? We get that you can store them as keywords instead but due to limitations in our existing client-side infrastructure it is too much of a lift to index all numerics as keyword+number multi-fields in order to both support fast terms & range queries. Choosing which of the two fields client-side to query would be a massive overhaul for our ecosystem.

elasticsearchmachine · 2024-07-04T09:12:34Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

rkophs added 2 commits February 22, 2023 15:20

add support for terms lookup for numeric fields

77b2e1d

The inverted terms index can be orders of magnitude faster for terms & terms queries when compared with Lucene Points (i.e. bkd trees) bkd trees are optimized for range-style queries but not for exact matches

add unit tests

7ce0d72

elasticsearchmachine added needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team v8.8.0 labels Feb 23, 2023

iverase added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed needs:triage Requires assignment of a team area label labels Feb 23, 2023

elasticsearchmachine added the Team:Search Meta label for search team label Feb 23, 2023

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

stu-elastic added the >enhancement label Apr 27, 2023

pugnascotia added v8.10.0 and removed v8.9.0 labels Jun 22, 2023

benwtrent mentioned this pull request Jul 28, 2023

Optimize massive lookups #97947

Open

quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

brianseeders added v8.13.0 and removed v8.12.0 labels Dec 6, 2023

javanna mentioned this pull request Jan 9, 2024

Add back a terms index for numeric fields #94047

Open

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

elasticsearchmachine added v8.16.0 Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed v8.15.0 labels Jul 4, 2024

elasticsearchmachine removed the Team:Search Meta label for search team label Jul 4, 2024

mark-vieira added v9.0.0 and removed v8.16.0 labels Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numeric terms support #94048

Numeric terms support #94048

rkophs commented Feb 23, 2023

elasticsearchmachine commented Feb 23, 2023

iamkeyur commented Dec 14, 2023

elasticsearchmachine commented Jul 4, 2024

Numeric terms support #94048

Are you sure you want to change the base?

Numeric terms support #94048

Conversation

rkophs commented Feb 23, 2023

Background

elasticsearchmachine commented Feb 23, 2023

iamkeyur commented Dec 14, 2023

elasticsearchmachine commented Jul 4, 2024