migrations v2: check _cluster/allocation/explain when waitForIndexStatusYellow times out #118934
Labels
Feature:Migrations
impact:needs-assessment
Product and/or Engineering needs to evaluate the impact of the change.
loe:medium
Medium Level of Effort
project:ResilientSavedObjectMigrations
Reduce Kibana upgrade failures by making saved object migrations more resilient
Team:Core
Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
The following steps / actions create a new blank index or create an index by cloning an existing one.
For both of these actions, if the create/clone API call reaches a timeout we use
waitForIndexStatusYellow
to wait for the index to become allocated. If the status is still not yellow after the timeoutwaitForIndexStatusYellow
will throw aretryable_es_client_error
so that migrations will retry the current step indefinitely.The problem with this is that Kibana keeps retrying and eventually fails to complete the migration but never surfaces the underlying cause to the user. Instead if
waitForIndexStatusYellow
reaches a timeout we should callGET _cluster/allocation/explain?index=${targetIndex}
and log the response before retrying so that it's clear to users what might be the cause of this action continuously failing.Based on https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html and output like below it's not easy to parse this API into a single human readable message. Sometimes it returns
"allocate_explanation"
sometimes"rebalance_explanation"
and yet other times"move_explanation"
, therefore it seems better to just log the entire API output.This will most likely affect users with a cluster where disk space exceeds the low watermark, but could also help pinpoint the problem that's preventing migrations to succeed in other unhealthy clusters.
The text was updated successfully, but these errors were encountered: