-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_id-less indices #48699
Comments
Pinging @elastic/es-distributed (:Distributed/Distributed) |
We discussed two special kinds of indices that could work without ids:
Read-only indices would be much simpler to implement and would fit nicely into index lifecycle management. Its main drawback is that savings are delayed to a later time, so this might not address the perception that Elasticsearch uses a lot of disk space, and Elasticsearch would still perform poorly disk-usage-wise in benchmarks unless turned into a read-only index at the end of the benchmark. Append-only indices would address these points but also bring more complexity in order to avoid duplicates because of client-side retries, or because of retries on the coordinating node. There is potential for better compression of these fields, but reducing by 20% would already be huge and would only reduce the overall index size by 10% assuming that Since the relative overhead of Next steps:
|
Dropping the I think there is a place for both kinds of Both kinds of implementations would be centered around a merge policy that drops the |
Some data points regarding how much we can expect to save with this change:
|
I'm curious about this one because the _id percentage seems very large. Are Elasticsearch logs ingested with the Elasticsearch module in Beats or directly ingested with just the |
@tsg The data can't be made public but I'll share more details privately with you. |
One aspect that I had not considered until now is that dropping |
I am trying to understand the full-impact of that change, and there is one thing not entirely clear to me yet: If we'd have We rely in Kibana in a lot of places on uniquely identifying objects by A non exhaustive list of features currently using
|
What @timroes mentioned for discover applies similarly to the Logs UI. We use the
Even if we had a different way to uniquely fetch specific docs it would ideally have to be applicable across all types of indices (datastream-backing or not). Otherwise separate code paths could represent an increased maintenance burden for many such single-document use-cases. |
@timroes @weltenwort This is the hardest part of this change indeed. Removing We have made good progress on other major contributors to storage-induced costs (introduction of the Cold and Frozen tiers, better compression of stored fields, introduction of In my opinion, the high disk footprint of There have been a few ideas how we could make this change easier on applications built on top of Elasticsearch. In some cases, documents can be uniquely identified through a few other fields, e.g. metrics samples can be uniquely identified by the combination of their |
While the argument for omitting ids is from a storage perspective is quite clear, the story around reading id-less documents is quite murky. I'd like to turn this around - "I want to read data (documents??) and I DON'T WANT IDS!" - how does that work? What is the user story? |
Commenting from our (profiling) perspective: The usual case for profiling data is write-once, retain for a given time period (perhaps 90 days), and read during that time. Data that is a few days old would never need to be changed, but storage efficiency for individual events is crucial. |
To add more details to what @thomasdullien is pointing out: For profiling events we currently are at ~45 bytes per events. Without |
So tsdb has started thinking a lot about So we build an And most things should just work with it. Kibana can search by Such a path may be an alternative, at least in the short term, to fully For tsdb I've been wondering if we can go further and drop the inverted index on the |
Various experiments over time have highlighted that the
_id
field, and to some extent the_seq_no
field, use non-negligible disk space. When indexing tiny documents, these two fields combined can end up using more than 50% of the index size. Are there conditions in which we could enable users to drop the_id
field in order to save resources?The text was updated successfully, but these errors were encountered: