Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILM Phase Execution on Index Count, Aggregate Size, or FIFO #47764

Open
woodchalk opened this issue Oct 9, 2019 · 15 comments
Open

ILM Phase Execution on Index Count, Aggregate Size, or FIFO #47764

woodchalk opened this issue Oct 9, 2019 · 15 comments
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >feature high hanging fruit Team:Data Management Meta label for data/management team

Comments

@woodchalk
Copy link

ILM phases outside of hot rely exclusively on min_age for execution. There is currently no way to execute phases on any other criteria, which leaves Elasticsearch susceptible to out-of-space emergencies when indexes increase slowly over time. Age-based execution may be advantageous to policy (keep abc logs for xyz months), but it is not useful for resource maximization (I want to use 90% of disk space).

Executing phases based on the count of indexes or aggregate sizes promotes better resource usage. I’m more interested in keeping as many indexes as my infrastructure will allow. I see a few ways to achieve that.

Execute phases based on index count. This model would allow you to define fixed index counts within each policy. The advantage being that this is easy. For example: I’d like to rollover hot after 10GB, and keep 9 indexes in warm. This policy would never grow past 100GB.

Execute phases based on aggregate size. This model would allow you to define cumulative index sizes within a phase. The advantage being that this is also easy, but covers more corner cases than a simple count. For example: I’d like to rollover hot after 10GB or 2 days, and keep 90GB of indexes in warm. This policy would keep as much data as possible within the aggregate bounds defined. Perhaps the daily indexes grow to 10GB, but the weekend indexes grow to only 4GB, this would ensure you keep as much data in the policy as possible.

Execute phases based on FIFO. At a high level, remove the oldest indexes within the cluster on a first in first out basis. You define an operating threshold with a cluster and enforce a delete phase when you reach it. The advantage being that this is truly disaster-proof (i.e. no more read_only_allow_delete!!). For example: My 1TB cluster should remove the oldest index when my indexes use more than 90% disk space or 900GB.

@jimczi jimczi added :Data Management/ILM+SLM Index and Snapshot lifecycle management >feature labels Oct 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@hamishforbes
Copy link

hamishforbes commented Oct 9, 2019

Just came here to post this exact issue.

Index count based phases would be incredibly useful
I'm in the process of moving various services over from Cloudwatch logging to logging into ELK.
This means my log volume is wildy variable.

Index rollover happens on xGB, this is great. My indices are all always the same size.
However I constantly have out of disk space issues because my indices are deleted after n days, but on Monday I might create 5 indices of x GB and on Tuesday only 1 index.

Setting retention based on days is causing all kinds of problems for me.
I either set my retention really low and have loads of (paid for) wasted disk space, or set it correctly for the current workload and have it blow up at 2am because my workload has changed, and losing a bunch of logs until I can fix it.

@hamishforbes
Copy link

Hi, has there been any progress on this issue?
It's come up for us again because we've seen ~30-40% increase in traffic this week (COVID-19 related) which has caused a corresponding increase in log data and our elastic cluster blew up at 4am due to running out of disk, again.

I'm curious why @jakelandis added the 'high hanging fruit' label, could someone elaborate on why this is difficult?
It seems like adding a count condition as well as max_age would be very simple, but I'm not familiar with the codebase.

It looks like there's been a couple of other issues posted relating to this as well #49392 #52308

@jasontedor
Copy link
Member

@hamishforbes It is high-hanging fruit because of the architecture of ILM, which is oriented around managing a single index at a time, but the request here is to manage a group of indices (e.g., whose name share a common prefix). Since that's a fundamental rearchitecture/requires an investment in new infrastructure in the codebase, there isn't a quick win here.

@hamishforbes
Copy link

Ah I see, because an ILM policy can apply to multiple groups of indices. That makes sense, thanks for the insight!

@0xtf
Copy link

0xtf commented Apr 23, 2020

Just wanted to add a +1 to this.

Defining an ILM policy solely on size is absolutely critical for inconsistent workloads.

There are many examples/scenarios, but one I personally experience is how hard it is to size intake for network-related data. If I have one site that I can do a rough estimation, hardly additional sites will follow the same principles (number of people, type of traffic, site function (datacenter vs office) and many other factors).

The possibility of FIFO would ease things even further.

At this point I know of many deployments that still haven't found a balance between amount of data to keep and availability, so deployments are actually wasting resources but getting rid of data too soon with the fear of a sudden intake causing downtime.

@hamishforbes
Copy link

FWIW I have since disabled the delete phase in my ILM policy and switched back to elasticsearch-curator using an index prefix and count for retention

Here'e a graph of % free space across my logging cluster, I don't think I need to point out which day I made the switch on :)
Screenshot 2020-04-24 at 08 47 32

@0xtf
Copy link

0xtf commented Apr 24, 2020 via email

@cataclysdom
Copy link

+1

ILM needs to handle the overall index lifecycle instead of a single index at a time. Curator is an option, but just another workaround for functionality that should exist as a core component.

@fbaligand
Copy link
Contributor

At least, having a condition based on index count would be great, and not so complicated I think.
ILM is based on an alias, so it can know how many indices are linked to the alias (BTW, Kibana ILM policy management shows this info). Then, choose the oldest index linked to the alias, when index count is over the limit, to perform the phase action.

@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
@georgettica
Copy link

hey! what is the state of this issue? curious to know and help if I can

@krejkrejkrej
Copy link

Apart from a "me too" on this request, let me also add my use case for this functionality. Similar, but slightly different.

We have one collection of time based log index series that are important, there must always be space available in the cluster to ingest new logs for these series. There are also other less-important log index series in the cluster. I want to set a hard limit on the size of the less-important indices, to make sure that a bad-behaving less-important service can not fill up disk space and cause the important indices to go read-only. Number of indices in a series, or the cumulative size, in bytes or documents, of all indices in a series - any one would do. When this limit has been reached - let ILM execute an action like "delete the oldest index" or "reject writes to the indices".

@berglh
Copy link
Contributor

berglh commented Nov 17, 2022

I'm getting started on ILM with our ECE deployment, and I was surprised to find that I am unable to phase change based on the number of indices sitting behind a data stream in Hot Tier. We have around 40 data streams in our legacy cluster which uses date-based suffixes on yearly, monthly, weekly and daily rotation strategies. I managed to classify these into 6 different ILM policies based on size of index for rollover and number of indices to keep in hot and frozen tiers. However, if I am limited by age, I will need to create 40 different ILM policies to get a similar effect. I am fine not basing retention on age, as the limiting factor of the cluster is the storage; using age to define retention seems short sighted when there are practical limits to RAM to Storage ratios on licensed capacity. By using size, we can predictably create safe limits for data streams that will stay within the storage confinements of our architecture.

@tylerperk tylerperk self-assigned this Sep 5, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@nicpenning
Copy link

There are some interesting use cases here for sure!

I can see where it's not always about retention days because that does not typically answer how much data from a storage perspective is being retained.

As a security engineer of data resources in the stack, I may have new log sources or ones that are unpredictable in data consumption. Because of this, I would like to set a max storage consumption on a data stream that does allow the oldest index or indices to be removed after the respective threshold in GB is reached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >feature high hanging fruit Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests