ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes #78246

original-brownbear · 2021-09-23T12:20:45Z

ILM's org.elasticsearch.xpack.ilm.IndexLifecycleService#triggerPolicies can queue up an unlimited number of cluster state updates on slow master nodes. This method is invoked on every cluster state application.
It submits tasks for every index that it decides work needs to be done on with priority NORMAL. So the following can happen easily under load:

master works through a number of higher than NORMAL priority tasks
each of them triggers an ILM task at priority normal for each index that has outstanding work (without checking for duplicates)
=> as master works through the higher priority tasks it uses up more and more memory for queued ILM tasks as long as there's outstanding higher priority work
=> even if and when master gets to working through the NORMAL priority tasks, each of them will yet again trigger all policies adding more duplicate work, eventually leading to runaway task counts if things slow down enough

ILM needs to make sure to limit and deduplicate tasks to avoid running into this. I will see if I can find a quick fix to this situation to unblock benchmarking, but it seems a complete solution is quite involved.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-09-23T12:20:48Z

Pinging @elastic/es-data-management (Team:Data Management)

dakrone · 2021-09-23T19:08:11Z

We can actually leave out a lot of the regular ILM policy execution on cluster state change if we want, letting those pick up during the regular interval execution (every 10m by default). This would mean things would take slightly longer, but with the benefit of not spawning so many cluster state updates on a new cluster state.

I'd be happy to brainstorm some solutions to this if you would like.

Prevent duplicate ILM tasks from being enqueued to fix the most immediate issues around #78246. The ILM logic should be further improved though. I did not include `MoveToErrorStepUpdateTask` in this change yet as I wasn't entirely sure how valid/safe hashing/comparing arbitrary `Exception`s would be. That could be looked into in a follow-up as well. Relates #77466 Closes #78246

…ic#78390) Prevent duplicate ILM tasks from being enqueued to fix the most immediate issues around elastic#78246. The ILM logic should be further improved though. I did not include `MoveToErrorStepUpdateTask` in this change yet as I wasn't entirely sure how valid/safe hashing/comparing arbitrary `Exception`s would be. That could be looked into in a follow-up as well. Relates elastic#77466 Closes elastic#78246

… (#78427) Prevent duplicate ILM tasks from being enqueued to fix the most immediate issues around #78246. The ILM logic should be further improved though. I did not include `MoveToErrorStepUpdateTask` in this change yet as I wasn't entirely sure how valid/safe hashing/comparing arbitrary `Exception`s would be. That could be looked into in a follow-up as well. Relates #77466 Closes #78246

original-brownbear added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Sep 23, 2021

original-brownbear self-assigned this Sep 23, 2021

elasticmachine added the Team:Data Management Meta label for data/management team label Sep 23, 2021

original-brownbear mentioned this issue Sep 23, 2021

Fix Large Shard Count Scalability Issues #77466

Open

97 tasks

original-brownbear changed the title ~~ILM Can Create an Unlimited Number Pending Clusterstate Updates on Slow Master Nodes~~ ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes Sep 23, 2021

This was referenced Sep 27, 2021

[WIP] ILM Refactoring to Fix Task Explosion #78300

Closed

Prevent Duplicate ILM Cluster State Updates from Being Created #78390

Merged

original-brownbear closed this as completed in #78390 Sep 29, 2021

original-brownbear mentioned this issue Sep 29, 2021

Prevent Duplicate ILM Cluster State Updates from Being Created (#78390) #78427

Merged

This was referenced Sep 6, 2022

Duplicate ILM Cluster State Updates when policy is deleted #89831

Open

Prevent Duplicate ILM Cluster State Updates when policy is deleted #89832

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes #78246

ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes #78246

original-brownbear commented Sep 23, 2021

elasticmachine commented Sep 23, 2021

dakrone commented Sep 23, 2021

ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes #78246

ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes #78246

Comments

original-brownbear commented Sep 23, 2021

elasticmachine commented Sep 23, 2021

dakrone commented Sep 23, 2021