Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic retries for ILM rollover action #44135

Closed
ppf2 opened this issue Jul 9, 2019 · 3 comments
Closed

Automatic retries for ILM rollover action #44135

ppf2 opened this issue Jul 9, 2019 · 3 comments
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management

Comments

@ppf2
Copy link
Member

ppf2 commented Jul 9, 2019

ILM rollover action can fail for a variety of reasons. The following are 2 examples:

  1. Cluster state publishing/master timeout
  2. Flood stage threshold/write blocks on indices

Without automatic retries, any failures in the rollover action can result in an undesirable behavior where the current index can continue to grow many times beyond the max_size set with shards end up being hundreds of Gbs in size - unless the administrator detects this in time to manually intervene.

Depending on the type of failure, manual intervention may also require more than just calling the _retry endpoint for ILM. For example, if the failure is failed to process cluster event (index-aliases) within 30s, it will result in an index created without an update to the alias so that the index is not used. So the admin will have to realize this and do some cleanup of the index before attempting the _retry call.

It will be great if ILM can automatic retry (and perform any clean up actions required). For the first use case above, this means realizing that an index was created without a successful update of the alias and be able to clean up automatically before retrying. For the second use case above, this means after the admin has addressed flood stage watermark being hit (i.e., disk capacity and removing write blocks from the indices), be able to automatically retry without requiring admins to remember to go back to ILM and issue a retry after removing write blocks from Elasticsearch indices.

Discussed with @gwbrown. We decided to create this separate issue to special-case rollover for automatic retries given the implications of an unnoticed failed rollover causing other performance issues in the cluster (due to the unbounded growth of index size until hitting flood stage), before implementing a more generic retry strategy that works for all lLM actions.

@ppf2 ppf2 added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Jul 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@gboddin
Copy link

gboddin commented Sep 18, 2019

+100 ILM currently makes it so you can break your infra when it fails ( aka have immense index requiring a split ).

#!/usr/bin/env python3
import requests
import os
import time
import logging
logging.basicConfig(level=logging.INFO, format='[%(asctime)s] %(levelname)s - %(message)s')
logging.info("IlmReactor started")
url = os.getenv("ES_URL", "http://localhost:9200")
es_server_info = requests.get(url).json()

# looop
while True:
    ilm_details = requests.get(url + '/*/_ilm/explain').json()
    for index_name, index_details in ilm_details['indices'].items():
        if index_details["managed"] and index_details["step"] == "ERROR":
            logging.info("Found index %s with error %s" % (index_name, index_details["step_info"]['type']))
            if index_details["step_info"]['type'] == "process_cluster_event_timeout_exception":
                # Loop while cluster is busy :
                while True:
                    status = requests.get(
                        "%s/_cluster/health?wait_for_status=green&timeout=10s&wait_for_no_relocating_shards=true" % url
                    )
                    if status.status_code == 200:
                        break
                        # Evade loop when cluster is OK
                    else:
                        logging.warning("Server is unhealthy or currently moving indices ...")
                # Cluster is OK , retrying ILM on this index
                logging.info("Retrying ILM for index %s" % index_name)
                logging.info(requests.post("%s/%s/_ilm/retry" % (url, index_name), "").json())
    # Avoid DDOS
    time.sleep(10)

That's how we handle timeouts or closed indices race conditions atm

@andreidan
Copy link
Contributor

With the latest #50718 and #51235 being merged we can close this issue as all the steps in the rollover action are now retryable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management
Projects
None yet
Development

No branches or pull requests

4 participants