Automatic retries for ILM rollover action #44135

ppf2 · 2019-07-09T19:27:50Z

ILM rollover action can fail for a variety of reasons. The following are 2 examples:

Cluster state publishing/master timeout
Flood stage threshold/write blocks on indices

Without automatic retries, any failures in the rollover action can result in an undesirable behavior where the current index can continue to grow many times beyond the max_size set with shards end up being hundreds of Gbs in size - unless the administrator detects this in time to manually intervene.

Depending on the type of failure, manual intervention may also require more than just calling the _retry endpoint for ILM. For example, if the failure is failed to process cluster event (index-aliases) within 30s, it will result in an index created without an update to the alias so that the index is not used. So the admin will have to realize this and do some cleanup of the index before attempting the _retry call.

It will be great if ILM can automatic retry (and perform any clean up actions required). For the first use case above, this means realizing that an index was created without a successful update of the alias and be able to clean up automatically before retrying. For the second use case above, this means after the admin has addressed flood stage watermark being hit (i.e., disk capacity and removing write blocks from the indices), be able to automatically retry without requiring admins to remember to go back to ILM and issue a retry after removing write blocks from Elasticsearch indices.

Discussed with @gwbrown. We decided to create this separate issue to special-case rollover for automatic retries given the implications of an unnoticed failed rollover causing other performance issues in the cluster (due to the unbounded growth of index size until hitting flood stage), before implementing a more generic retry strategy that works for all lLM actions.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-07-09T19:27:52Z

Pinging @elastic/es-core-features

gboddin · 2019-09-18T14:01:34Z

+100 ILM currently makes it so you can break your infra when it fails ( aka have immense index requiring a split ).

#!/usr/bin/env python3
import requests
import os
import time
import logging
logging.basicConfig(level=logging.INFO, format='[%(asctime)s] %(levelname)s - %(message)s')
logging.info("IlmReactor started")
url = os.getenv("ES_URL", "http://localhost:9200")
es_server_info = requests.get(url).json()

# looop
while True:
    ilm_details = requests.get(url + '/*/_ilm/explain').json()
    for index_name, index_details in ilm_details['indices'].items():
        if index_details["managed"] and index_details["step"] == "ERROR":
            logging.info("Found index %s with error %s" % (index_name, index_details["step_info"]['type']))
            if index_details["step_info"]['type'] == "process_cluster_event_timeout_exception":
                # Loop while cluster is busy :
                while True:
                    status = requests.get(
                        "%s/_cluster/health?wait_for_status=green&timeout=10s&wait_for_no_relocating_shards=true" % url
                    )
                    if status.status_code == 200:
                        break
                        # Evade loop when cluster is OK
                    else:
                        logging.warning("Server is unhealthy or currently moving indices ...")
                # Cluster is OK , retrying ILM on this index
                logging.info("Retrying ILM for index %s" % index_name)
                logging.info(requests.post("%s/%s/_ilm/retry" % (url, index_name), "").json())
    # Avoid DDOS
    time.sleep(10)

That's how we handle timeouts or closed indices race conditions atm

andreidan · 2020-01-22T11:46:29Z

With the latest #50718 and #51235 being merged we can close this issue as all the steps in the rollover action are now retryable.

ppf2 added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Jul 9, 2019

gwbrown mentioned this issue Oct 2, 2019

Fix Rollover error when alias has closed indices #47148

Merged

andreidan self-assigned this Oct 8, 2019

This was referenced Oct 17, 2019

Retry ILM steps when transient or recoverable errors are encountered #48183

Closed

ILM Make the check-rollover-ready step retryable #48256

Merged

shwetathareja mentioned this issue Dec 24, 2019

Make the TransportRolloverAction execute in one cluster state update #50388

Merged

andreidan mentioned this issue Dec 30, 2019

ILM retryable async action steps #50522

Merged

This was referenced Jan 16, 2020

ILM wait for active shards on rolled index in a separate step #50718

Merged

ILM: Make UpdateSettingsStep retryable #51235

Merged

andreidan closed this as completed Jan 22, 2020

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket elastic/elasticsearch-net#4525

Closed

38 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic retries for ILM rollover action #44135

Automatic retries for ILM rollover action #44135

ppf2 commented Jul 9, 2019

elasticmachine commented Jul 9, 2019

gboddin commented Sep 18, 2019

andreidan commented Jan 22, 2020

Automatic retries for ILM rollover action #44135

Automatic retries for ILM rollover action #44135

Comments

ppf2 commented Jul 9, 2019

elasticmachine commented Jul 9, 2019

gboddin commented Sep 18, 2019

andreidan commented Jan 22, 2020