Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework ILM to not Require Inspecting all Indices on every Cluster State Update #80407

Open
Tracked by #77466
original-brownbear opened this issue Nov 5, 2021 · 2 comments
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement >refactoring Team:Data Management Meta label for data/management team

Comments

@original-brownbear
Copy link
Member

At the moment ILM scales somewhat poorly as we move to very large numbers of indices. The reason for this is that org.elasticsearch.xpack.ilm.IndexLifecycleService#clusterChanged does a full inspection of all indices in the cluster state to see if there is work to be done by ILM.

This inspection of all the indices itself is fairly expensive because it requires parsing per-index metadata into LifecycleExecutionState (repeatedly) and more importantly calls the expensive org.elasticsearch.xpack.ilm.IndexLifecycleRunner#getCurrentStep(org.elasticsearch.xpack.ilm.PolicyStepsRegistry, java.lang.String, org.elasticsearch.cluster.metadata.IndexMetadata, org.elasticsearch.xpack.core.ilm.LifecycleExecutionState) in a hot loop.

Ideally, ILM should be refactored into something more similar to the SnapshotService which will only do a full inspection of all snapshots+shards on a master failover, but otherwise keeps track of its internal state directly on the master node.
Concretely, this would mean that when an index moves from one state to another state, the requires actions would just be chained logically through a series of callbacks rollover-step -> do rollover -> next-step instead of the current model where the step transitions are triggered by the changes in the cluster state that the previous step caused.

This would make ILM scale pretty much O(1) outside of the master-failover scenario.

Relates #77466

@original-brownbear original-brownbear added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Nov 5, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Nov 5, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@PDTCCLF
Copy link

PDTCCLF commented Jun 13, 2022

I'd like to send a pull request to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement >refactoring Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

5 participants