Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numeric terms support #94048

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

rkophs
Copy link

@rkophs rkophs commented Feb 23, 2023

Issue: #94047

Background

Pre Elastic 5.x, numerics & dates used to be indexed into an inverted terms index. However, this PR changed the underlying data structure to Lucene points (implemented as binary KD trees). Lucene Points offer performance gains for range queries. However, they are expected to perform worse for exact queries (i.e. term/terms queries). This poses a problem when numeric identifiers (whose query patterns tend to be term-heavy) are stored as numbers.

Elastic's recommendation from ES 5.x forward is to store numeric identifiers as a stringified keyword in order to get the query performance boost for term/terms queries. This poses a several limitations:

  • It is not intuitive to a user that a numeric identifier should be stored as a keyword instead of as a long, integer, etc without having an intimate knowledge of modern ES's data structures. Especially as other databases/stores (like MySql or just about any other structured DB would recommend using a numeric type) and because the guidance prior to ES 5.x was to use a numeric identifier. Furthermore, the guidance in the ES documentation is somewhat hidden and hard to find (for new ES users and for old ES users who are accustomed to pre ES 5.x functionality).
  • Using a keyword means we lose the ability to do range queries while preserving the numeric ordering of the terms. Of course, this can be overcome by indexing the number into a multi-field as both a numeric and a keyword. However, doing so requires changing thousands of lines of client-side code to choose between the fields for terms vs range queries.
  • Using a keyword field also means we must eagerly initialized global ordinals to support fast term aggregations because they require global ordinals that are lazily loaded rather than the doc values that numeric fields have. Eagerly initialized global ordinals have demonstrated 10-15x query speed improvement, but also leads to increased indexing overhead and memory utilization. We do have certain high-cardinality numeric identifiers where we perform routine terms aggregations (e.g. for customer identifiers which can reach into the millions of unique terms). Again, perhaps executing the terms aggregation on a numeric within a multi-field will solve this but it leads to heavy client-side code changes.
  • Using a keyword means that terms queries get split in a Boolean query with separate disjunctive clauses for each term. This causes the query to fail if the term count is higher than the configured max_clause_count setting even though it may be well within the configured max_terms_count index setting of 65,536.

All in all, for our application we created a plugin that re-implements the native ES numeric fields, indexing the terms into Lucene Points AND into a inverted terms index. Doing so has provided several benefits:

  • We have seen query speeds for term/terms queries improve 10-100x for some of our larger workloads.
  • No client-side code was necessary to handle a multi-field.
  • Doc-values are used for the terms aggregations (no need to eagerly initialize global ordinals) leading to faster response times for terms aggregations, faster indexing speed and much smaller memory overhead.
  • We no longer have failing terms queries on a keyword due to max_clause_count (whose default is far less than the max_terms_count for a terms query).

It would be much simpler if we can index the data into the appropriate data structures within a single field where the user can opt-in to enabling a term-index similar to how doc-values can be enabled & disabled. In this approach, by default, numerics will continue to operate as they always have. However, the user may specify a field setting called terms: (true|false) on the index mapping to define whether the numeric should additionally be indexed into an inverted terms index. I've applied the change to all numerics and date fields.

Spot the improvement in our query response times when we switched to using a numeric terms index to fulfill our term/terms queries:
Screenshot 2023-02-22 at 23 25 11

I am posting this PR more as a conversation starter. I understand if this may not be the Elastic community's preferred approach, but I do hope the community will consider some of these changes given the limitations of the current system and the substantial benefits we have seen on our own workloads as described above.

The inverted terms index can be orders of magnitude faster for
terms & terms queries when compared with Lucene Points (i.e. bkd trees)

bkd trees are optimized for range-style queries but not for exact matches
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team v8.8.0 labels Feb 23, 2023
@iverase iverase added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed needs:triage Requires assignment of a team area label labels Feb 23, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Feb 23, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@iamkeyur
Copy link

This would be really nice to have -- not storing numerics in the inverted index is a huge regression for us. Can we look into this? We get that you can store them as keywords instead but due to limitations in our existing client-side infrastructure it is too much of a lift to index all numerics as keyword+number multi-fields in order to both support fast terms & range queries. Choosing which of the two fields client-side to query would be a massive overhaul for our ecosystem.

@elasticsearchmachine elasticsearchmachine added v8.16.0 Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed v8.15.0 labels Jul 4, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine elasticsearchmachine removed the Team:Search Meta label for search team label Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.