migrations v2: check _cluster/allocation/explain when waitForIndexStatusYellow times out #118934

rudolf · 2021-11-17T19:51:49Z

The following steps / actions create a new blank index or create an index by cloning an existing one.

step	action
CLONE_TEMP_TO_TARGET	cloneIndex
CREATE_REINDEX_TEMP	createIndex

For both of these actions, if the create/clone API call reaches a timeout we use waitForIndexStatusYellow to wait for the index to become allocated. If the status is still not yellow after the timeout waitForIndexStatusYellow will throw a retryable_es_client_error so that migrations will retry the current step indefinitely.

The problem with this is that Kibana keeps retrying and eventually fails to complete the migration but never surfaces the underlying cause to the user. Instead if waitForIndexStatusYellow reaches a timeout we should call GET _cluster/allocation/explain?index=${targetIndex} and log the response before retrying so that it's clear to users what might be the cause of this action continuously failing.

Based on https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html and output like below it's not easy to parse this API into a single human readable message. Sometimes it returns "allocate_explanation" sometimes "rebalance_explanation" and yet other times "move_explanation", therefore it seems better to just log the entire API output.

{
  "index" : ".kibana_8.1.0_001",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2021-11-17T16:50:07.473Z",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "RL3Eiz1RSpmzvAsdsSw3JQ",
      "node_name" : "node-01",
      "transport_address" : "127.0.0.1:9300",
      "node_attributes" : {
        "ml.machine_memory" : "34359738368",
        "xpack.installed" : "true",
        "ml.max_jvm_size" : "1610612736"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [11.692661332965082%]"
        }
      ]
    }
  ]
}

This will most likely affect users with a cluster where disk space exceeds the low watermark, but could also help pinpoint the problem that's preventing migrations to succeed in other unhealthy clusters.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-11-17T19:51:50Z

Pinging @elastic/kibana-core (Team:Core)

rudolf · 2022-04-06T11:16:39Z

Because the output of this API isn't easy to understand and act on, and because the JSON output would be even harder to read when it's not pretty printed as part of a server log line, I don't think this would really help users.

Instead, linking to documentation that asks users to call the cluster allocation explain API for themselves could be more helpful since they'll see the pretty printed output.

pgayvallet · 2022-04-08T09:17:12Z

I agree. Should we close this current issue then, given documenting the problem is part of #128585?

rudolf · 2022-04-11T09:00:07Z

👍 Closing in favour of #128585

rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Nov 17, 2021

rudolf mentioned this issue Nov 19, 2021

Revert "migrations: handle 200 response code from _cluster/health API… #119136

Merged

9 tasks

This was referenced Jan 14, 2022

Add error logs when preventing index creation because of low disk space elastic/elasticsearch#82617

Open

[UA] Check and validate node disk space #123040

Closed

This was referenced Jan 31, 2022

migrationsv2 fail when replica allocation is disabled #124139

Closed

Alert user with critical upgrade warning if the cluster disk space is low elastic/elasticsearch#82807

Closed

rudolf added the Feature:Migrations label Feb 8, 2022

exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort loe:medium Medium Level of Effort and removed loe:small Small Level of Effort labels Feb 8, 2022

rudolf mentioned this issue Feb 23, 2022

Kibana upgrade timesout in migration. #123847

Closed

pgayvallet mentioned this issue Mar 28, 2022

migrations fail with "Timeout waiting for the status of the [.kibana_{VERSION}_reindex_temp] index to become 'yellow'" #128585

Closed

rudolf closed this as completed Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migrations v2: check _cluster/allocation/explain when waitForIndexStatusYellow times out #118934

migrations v2: check _cluster/allocation/explain when waitForIndexStatusYellow times out #118934

rudolf commented Nov 17, 2021 •

edited

Loading

elasticmachine commented Nov 17, 2021

rudolf commented Apr 6, 2022

pgayvallet commented Apr 8, 2022

rudolf commented Apr 11, 2022

migrations v2: check _cluster/allocation/explain when waitForIndexStatusYellow times out #118934

migrations v2: check _cluster/allocation/explain when waitForIndexStatusYellow times out #118934

Comments

rudolf commented Nov 17, 2021 • edited Loading

elasticmachine commented Nov 17, 2021

rudolf commented Apr 6, 2022

pgayvallet commented Apr 8, 2022

rudolf commented Apr 11, 2022

rudolf commented Nov 17, 2021 •

edited

Loading