Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes #78246

Closed
Tracked by #77466
original-brownbear opened this issue Sep 23, 2021 · 2 comments · Fixed by #78390
Closed
Tracked by #77466
Assignees
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team

Comments

@original-brownbear
Copy link
Member

ILM's org.elasticsearch.xpack.ilm.IndexLifecycleService#triggerPolicies can queue up an unlimited number of cluster state updates on slow master nodes. This method is invoked on every cluster state application.
It submits tasks for every index that it decides work needs to be done on with priority NORMAL. So the following can happen easily under load:

  • master works through a number of higher than NORMAL priority tasks
  • each of them triggers an ILM task at priority normal for each index that has outstanding work (without checking for duplicates)
    => as master works through the higher priority tasks it uses up more and more memory for queued ILM tasks as long as there's outstanding higher priority work
    => even if and when master gets to working through the NORMAL priority tasks, each of them will yet again trigger all policies adding more duplicate work, eventually leading to runaway task counts if things slow down enough

ILM needs to make sure to limit and deduplicate tasks to avoid running into this. I will see if I can find a quick fix to this situation to unblock benchmarking, but it seems a complete solution is quite involved.

@original-brownbear original-brownbear added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Sep 23, 2021
@original-brownbear original-brownbear self-assigned this Sep 23, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Sep 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@original-brownbear original-brownbear changed the title ILM Can Create an Unlimited Number Pending Clusterstate Updates on Slow Master Nodes ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes Sep 23, 2021
@dakrone
Copy link
Member

dakrone commented Sep 23, 2021

We can actually leave out a lot of the regular ILM policy execution on cluster state change if we want, letting those pick up during the regular interval execution (every 10m by default). This would mean things would take slightly longer, but with the benefit of not spawning so many cluster state updates on a new cluster state.

I'd be happy to brainstorm some solutions to this if you would like.

original-brownbear added a commit that referenced this issue Sep 29, 2021
Prevent duplicate ILM tasks from being enqueued to fix the most immediate issues around #78246. The ILM logic should be further improved though. I did not include `MoveToErrorStepUpdateTask` in this change yet as I wasn't entirely sure how valid/safe hashing/comparing arbitrary `Exception`s would be. That could be looked into in a follow-up as well.

Relates #77466 

Closes #78246
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Sep 29, 2021
…ic#78390)

Prevent duplicate ILM tasks from being enqueued to fix the most immediate issues around elastic#78246. The ILM logic should be further improved though. I did not include `MoveToErrorStepUpdateTask` in this change yet as I wasn't entirely sure how valid/safe hashing/comparing arbitrary `Exception`s would be. That could be looked into in a follow-up as well.

Relates elastic#77466 

Closes elastic#78246
original-brownbear added a commit that referenced this issue Sep 29, 2021
… (#78427)

Prevent duplicate ILM tasks from being enqueued to fix the most immediate issues around #78246. The ILM logic should be further improved though. I did not include `MoveToErrorStepUpdateTask` in this change yet as I wasn't entirely sure how valid/safe hashing/comparing arbitrary `Exception`s would be. That could be looked into in a follow-up as well.

Relates #77466 

Closes #78246
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants