-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes #78246
Comments
Pinging @elastic/es-data-management (Team:Data Management) |
We can actually leave out a lot of the regular ILM policy execution on cluster state change if we want, letting those pick up during the regular interval execution (every 10m by default). This would mean things would take slightly longer, but with the benefit of not spawning so many cluster state updates on a new cluster state. I'd be happy to brainstorm some solutions to this if you would like. |
Prevent duplicate ILM tasks from being enqueued to fix the most immediate issues around #78246. The ILM logic should be further improved though. I did not include `MoveToErrorStepUpdateTask` in this change yet as I wasn't entirely sure how valid/safe hashing/comparing arbitrary `Exception`s would be. That could be looked into in a follow-up as well. Relates #77466 Closes #78246
…ic#78390) Prevent duplicate ILM tasks from being enqueued to fix the most immediate issues around elastic#78246. The ILM logic should be further improved though. I did not include `MoveToErrorStepUpdateTask` in this change yet as I wasn't entirely sure how valid/safe hashing/comparing arbitrary `Exception`s would be. That could be looked into in a follow-up as well. Relates elastic#77466 Closes elastic#78246
… (#78427) Prevent duplicate ILM tasks from being enqueued to fix the most immediate issues around #78246. The ILM logic should be further improved though. I did not include `MoveToErrorStepUpdateTask` in this change yet as I wasn't entirely sure how valid/safe hashing/comparing arbitrary `Exception`s would be. That could be looked into in a follow-up as well. Relates #77466 Closes #78246
ILM's
org.elasticsearch.xpack.ilm.IndexLifecycleService#triggerPolicies
can queue up an unlimited number of cluster state updates on slow master nodes. This method is invoked on every cluster state application.It submits tasks for every index that it decides work needs to be done on with priority
NORMAL
. So the following can happen easily under load:NORMAL
priority tasks=> as master works through the higher priority tasks it uses up more and more memory for queued ILM tasks as long as there's outstanding higher priority work
=> even if and when master gets to working through the
NORMAL
priority tasks, each of them will yet again trigger all policies adding more duplicate work, eventually leading to runaway task counts if things slow down enoughILM needs to make sure to limit and deduplicate tasks to avoid running into this. I will see if I can find a quick fix to this situation to unblock benchmarking, but it seems a complete solution is quite involved.
The text was updated successfully, but these errors were encountered: