ILM Phase Execution on Index Count, Aggregate Size, or FIFO #47764

woodchalk · 2019-10-09T02:29:40Z

ILM phases outside of hot rely exclusively on min_age for execution. There is currently no way to execute phases on any other criteria, which leaves Elasticsearch susceptible to out-of-space emergencies when indexes increase slowly over time. Age-based execution may be advantageous to policy (keep abc logs for xyz months), but it is not useful for resource maximization (I want to use 90% of disk space).

Executing phases based on the count of indexes or aggregate sizes promotes better resource usage. I’m more interested in keeping as many indexes as my infrastructure will allow. I see a few ways to achieve that.

Execute phases based on index count. This model would allow you to define fixed index counts within each policy. The advantage being that this is easy. For example: I’d like to rollover hot after 10GB, and keep 9 indexes in warm. This policy would never grow past 100GB.

Execute phases based on aggregate size. This model would allow you to define cumulative index sizes within a phase. The advantage being that this is also easy, but covers more corner cases than a simple count. For example: I’d like to rollover hot after 10GB or 2 days, and keep 90GB of indexes in warm. This policy would keep as much data as possible within the aggregate bounds defined. Perhaps the daily indexes grow to 10GB, but the weekend indexes grow to only 4GB, this would ensure you keep as much data in the policy as possible.

Execute phases based on FIFO. At a high level, remove the oldest indexes within the cluster on a first in first out basis. You define an operating threshold with a cluster and enforce a delete phase when you reach it. The advantage being that this is truly disaster-proof (i.e. no more read_only_allow_delete!!). For example: My 1TB cluster should remove the oldest index when my indexes use more than 90% disk space or 900GB.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-09T07:09:21Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

hamishforbes · 2019-10-09T08:59:07Z

Just came here to post this exact issue.

Index count based phases would be incredibly useful
I'm in the process of moving various services over from Cloudwatch logging to logging into ELK.
This means my log volume is wildy variable.

Index rollover happens on xGB, this is great. My indices are all always the same size.
However I constantly have out of disk space issues because my indices are deleted after n days, but on Monday I might create 5 indices of x GB and on Tuesday only 1 index.

Setting retention based on days is causing all kinds of problems for me.
I either set my retention really low and have loads of (paid for) wasted disk space, or set it correctly for the current workload and have it blow up at 2am because my workload has changed, and losing a bunch of logs until I can fix it.

hamishforbes · 2020-03-19T09:31:57Z

Hi, has there been any progress on this issue?
It's come up for us again because we've seen ~30-40% increase in traffic this week (COVID-19 related) which has caused a corresponding increase in log data and our elastic cluster blew up at 4am due to running out of disk, again.

I'm curious why @jakelandis added the 'high hanging fruit' label, could someone elaborate on why this is difficult?
It seems like adding a count condition as well as max_age would be very simple, but I'm not familiar with the codebase.

It looks like there's been a couple of other issues posted relating to this as well #49392 #52308

jasontedor · 2020-03-19T12:59:02Z

@hamishforbes It is high-hanging fruit because of the architecture of ILM, which is oriented around managing a single index at a time, but the request here is to manage a group of indices (e.g., whose name share a common prefix). Since that's a fundamental rearchitecture/requires an investment in new infrastructure in the codebase, there isn't a quick win here.

hamishforbes · 2020-03-19T13:08:51Z

Ah I see, because an ILM policy can apply to multiple groups of indices. That makes sense, thanks for the insight!

0xtf · 2020-04-23T21:57:18Z

Just wanted to add a +1 to this.

Defining an ILM policy solely on size is absolutely critical for inconsistent workloads.

There are many examples/scenarios, but one I personally experience is how hard it is to size intake for network-related data. If I have one site that I can do a rough estimation, hardly additional sites will follow the same principles (number of people, type of traffic, site function (datacenter vs office) and many other factors).

The possibility of FIFO would ease things even further.

At this point I know of many deployments that still haven't found a balance between amount of data to keep and availability, so deployments are actually wasting resources but getting rid of data too soon with the fear of a sudden intake causing downtime.

hamishforbes · 2020-04-24T07:49:57Z

FWIW I have since disabled the delete phase in my ILM policy and switched back to elasticsearch-curator using an index prefix and count for retention

Here'e a graph of % free space across my logging cluster, I don't think I need to point out which day I made the switch on :)

0xtf · 2020-04-24T12:18:14Z

That’s interesting and definitely seems like a viable alternative! :) Unfortunately, as a customer of Elastic Cloud, Curator is not an option (unless I have it running somewhere else, which kind of defeats the purpose of EC in the first place).

…

On Fri, 24 Apr 2020 at 08:50, Hamish Forbes ***@***.***> wrote: FWIW I have since disabled the delete phase in my ILM policy and switched back to elasticsearch-curator using an index prefix and count for retention Here'e a graph of % free space across my logging cluster, I don't think I need to point out which day I made the switch on :) [image: Screenshot 2020-04-24 at 08 47 32] <https://user-images.githubusercontent.com/1282135/80187881-48f38280-8608-11ea-849d-45bb9a8f4b5f.png> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47764 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZM7CHOHH7UQDA6TK5DSTDROFADLANCNFSM4I6ZJUMQ> .

cataclysdom · 2020-04-24T14:24:13Z

+1

ILM needs to handle the overall index lifecycle instead of a single index at a time. Curator is an option, but just another workaround for functionality that should exist as a core component.

fbaligand · 2020-04-28T17:01:21Z

At least, having a condition based on index count would be great, and not so complicated I think.
ILM is based on an alias, so it can know how many indices are linked to the alias (BTW, Kibana ILM policy management shows this info). Then, choose the oldest index linked to the alias, when index count is over the limit, to perform the phase action.

georgettica · 2020-07-22T12:53:03Z

hey! what is the state of this issue? curious to know and help if I can

krejkrejkrej · 2020-10-27T13:11:20Z

Apart from a "me too" on this request, let me also add my use case for this functionality. Similar, but slightly different.

We have one collection of time based log index series that are important, there must always be space available in the cluster to ingest new logs for these series. There are also other less-important log index series in the cluster. I want to set a hard limit on the size of the less-important indices, to make sure that a bad-behaving less-important service can not fill up disk space and cause the important indices to go read-only. Number of indices in a series, or the cumulative size, in bytes or documents, of all indices in a series - any one would do. When this limit has been reached - let ILM execute an action like "delete the oldest index" or "reject writes to the indices".

berglh · 2022-11-17T03:31:04Z

I'm getting started on ILM with our ECE deployment, and I was surprised to find that I am unable to phase change based on the number of indices sitting behind a data stream in Hot Tier. We have around 40 data streams in our legacy cluster which uses date-based suffixes on yearly, monthly, weekly and daily rotation strategies. I managed to classify these into 6 different ILM policies based on size of index for rollover and number of indices to keep in hot and frozen tiers. However, if I am limited by age, I will need to create 40 different ILM policies to get a similar effect. I am fine not basing retention on age, as the limiting factor of the cluster is the storage; using age to define retention seems short sighted when there are practical limits to RAM to Storage ratios on licensed capacity. By using size, we can predictably create safe limits for data streams that will stay within the storage confinements of our architecture.

elasticsearchmachine · 2023-09-05T19:18:51Z

Pinging @elastic/es-data-management (Team:Data Management)

nicpenning · 2023-10-01T18:49:06Z

There are some interesting use cases here for sure!

I can see where it's not always about retention days because that does not typically answer how much data from a storage perspective is being retained.

As a security engineer of data resources in the stack, I may have new log sources or ones that are unpredictable in data consumption. Because of this, I would like to set a max storage consumption on a data stream that does allow the oldest index or indices to be removed after the respective threshold in GB is reached.

jimczi added :Data Management/ILM+SLM Index and Snapshot lifecycle management >feature labels Oct 9, 2019

dakrone added the team-discuss label Oct 14, 2019

jakelandis added the high hanging fruit label Dec 5, 2019

matt-davis-elastic removed the team-discuss label Dec 12, 2019

rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020

a3ilson mentioned this issue Jan 12, 2022

Disk space pfelk/docker#34

Closed

hamishforbes mentioned this issue Mar 7, 2022

Curator not compatible with elasticsearch 8.0.0 elastic/curator#1639

Closed

tylerperk self-assigned this Sep 5, 2023

tylerperk mentioned this issue Oct 1, 2023

[Index Management] [Meta] Support for data stream lifecycle elastic/kibana#154256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILM Phase Execution on Index Count, Aggregate Size, or FIFO #47764

ILM Phase Execution on Index Count, Aggregate Size, or FIFO #47764

woodchalk commented Oct 9, 2019

elasticmachine commented Oct 9, 2019

hamishforbes commented Oct 9, 2019 •

edited

Loading

hamishforbes commented Mar 19, 2020

jasontedor commented Mar 19, 2020

hamishforbes commented Mar 19, 2020

0xtf commented Apr 23, 2020

hamishforbes commented Apr 24, 2020

0xtf commented Apr 24, 2020 via email

cataclysdom commented Apr 24, 2020

fbaligand commented Apr 28, 2020

georgettica commented Jul 22, 2020

krejkrejkrej commented Oct 27, 2020

berglh commented Nov 17, 2022

elasticsearchmachine commented Sep 5, 2023

nicpenning commented Oct 1, 2023

ILM Phase Execution on Index Count, Aggregate Size, or FIFO #47764

ILM Phase Execution on Index Count, Aggregate Size, or FIFO #47764

Comments

woodchalk commented Oct 9, 2019

elasticmachine commented Oct 9, 2019

hamishforbes commented Oct 9, 2019 • edited Loading

hamishforbes commented Mar 19, 2020

jasontedor commented Mar 19, 2020

hamishforbes commented Mar 19, 2020

0xtf commented Apr 23, 2020

hamishforbes commented Apr 24, 2020

0xtf commented Apr 24, 2020 via email

cataclysdom commented Apr 24, 2020

fbaligand commented Apr 28, 2020

georgettica commented Jul 22, 2020

krejkrejkrej commented Oct 27, 2020

berglh commented Nov 17, 2022

elasticsearchmachine commented Sep 5, 2023

nicpenning commented Oct 1, 2023

hamishforbes commented Oct 9, 2019 •

edited

Loading