-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic retries for ILM rollover action #44135
Labels
:Data Management/ILM+SLM
Index and Snapshot lifecycle management
Comments
Pinging @elastic/es-core-features |
+100 ILM currently makes it so you can break your infra when it fails ( aka have immense index requiring a split ).
That's how we handle timeouts or closed indices race conditions atm |
This was referenced Oct 17, 2019
This was referenced Jan 16, 2020
This was referenced Feb 3, 2020
38 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
ILM rollover action can fail for a variety of reasons. The following are 2 examples:
Without automatic retries, any failures in the rollover action can result in an undesirable behavior where the current index can continue to grow many times beyond the max_size set with shards end up being hundreds of Gbs in size - unless the administrator detects this in time to manually intervene.
Depending on the type of failure, manual intervention may also require more than just calling the _retry endpoint for ILM. For example, if the failure is
failed to process cluster event (index-aliases) within 30s
, it will result in an index created without an update to the alias so that the index is not used. So the admin will have to realize this and do some cleanup of the index before attempting the _retry call.It will be great if ILM can automatic retry (and perform any clean up actions required). For the first use case above, this means realizing that an index was created without a successful update of the alias and be able to clean up automatically before retrying. For the second use case above, this means after the admin has addressed flood stage watermark being hit (i.e., disk capacity and removing write blocks from the indices), be able to automatically retry without requiring admins to remember to go back to ILM and issue a retry after removing write blocks from Elasticsearch indices.
Discussed with @gwbrown. We decided to create this separate issue to special-case rollover for automatic retries given the implications of an unnoticed failed rollover causing other performance issues in the cluster (due to the unbounded growth of index size until hitting flood stage), before implementing a more generic retry strategy that works for all lLM actions.
The text was updated successfully, but these errors were encountered: