Rework ILM to not Require Inspecting all Indices on every Cluster State Update #80407

original-brownbear · 2021-11-05T11:21:31Z

At the moment ILM scales somewhat poorly as we move to very large numbers of indices. The reason for this is that org.elasticsearch.xpack.ilm.IndexLifecycleService#clusterChanged does a full inspection of all indices in the cluster state to see if there is work to be done by ILM.

This inspection of all the indices itself is fairly expensive because it requires parsing per-index metadata into LifecycleExecutionState (repeatedly) and more importantly calls the expensive org.elasticsearch.xpack.ilm.IndexLifecycleRunner#getCurrentStep(org.elasticsearch.xpack.ilm.PolicyStepsRegistry, java.lang.String, org.elasticsearch.cluster.metadata.IndexMetadata, org.elasticsearch.xpack.core.ilm.LifecycleExecutionState) in a hot loop.

Ideally, ILM should be refactored into something more similar to the SnapshotService which will only do a full inspection of all snapshots+shards on a master failover, but otherwise keeps track of its internal state directly on the master node.
Concretely, this would mean that when an index moves from one state to another state, the requires actions would just be chained logically through a series of callbacks rollover-step -> do rollover -> next-step instead of the current model where the step transitions are triggered by the changes in the cluster state that the previous step caused.

This would make ILM scale pretty much O(1) outside of the master-failover scenario.

Relates #77466

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-11-05T11:21:34Z

Pinging @elastic/es-data-management (Team:Data Management)

PDTCCLF · 2022-06-13T04:20:29Z

I'd like to send a pull request to this issue.

original-brownbear added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Nov 5, 2021

elasticmachine added the Team:Data Management Meta label for data/management team label Nov 5, 2021

original-brownbear assigned joegallo Nov 5, 2021

original-brownbear mentioned this issue Nov 5, 2021

Fix Large Shard Count Scalability Issues #77466

Open

97 tasks

original-brownbear added the >refactoring label Nov 5, 2021

PDTCCLF mentioned this issue Jun 13, 2022

ILM: Moving full inspection of all indices to MasterFailOver condition #87616

Open

gmarouli assigned gmarouli and unassigned joegallo Oct 31, 2022

original-brownbear mentioned this issue Feb 12, 2024

Performance Regression for every CS update from ILM's org.elasticsearch.cluster.metadata.Metadata#isIndexManagedByILM #98992

Open

nielsbauman mentioned this issue Nov 19, 2024

Optimize IndexLifecycleMetadata#getPolicies #116988

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework ILM to not Require Inspecting all Indices on every Cluster State Update #80407

Rework ILM to not Require Inspecting all Indices on every Cluster State Update #80407

original-brownbear commented Nov 5, 2021

elasticmachine commented Nov 5, 2021

PDTCCLF commented Jun 13, 2022

Rework ILM to not Require Inspecting all Indices on every Cluster State Update #80407

Rework ILM to not Require Inspecting all Indices on every Cluster State Update #80407

Comments

original-brownbear commented Nov 5, 2021

elasticmachine commented Nov 5, 2021

PDTCCLF commented Jun 13, 2022