Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrations v2: check _cluster/allocation/explain when waitForIndexStatusYellow times out #118934

Closed
rudolf opened this issue Nov 17, 2021 · 4 comments
Labels
Feature:Migrations impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@rudolf
Copy link
Contributor

rudolf commented Nov 17, 2021

The following steps / actions create a new blank index or create an index by cloning an existing one.

step action
CLONE_TEMP_TO_TARGET cloneIndex
CREATE_REINDEX_TEMP createIndex

For both of these actions, if the create/clone API call reaches a timeout we use waitForIndexStatusYellow to wait for the index to become allocated. If the status is still not yellow after the timeout waitForIndexStatusYellow will throw a retryable_es_client_error so that migrations will retry the current step indefinitely.

The problem with this is that Kibana keeps retrying and eventually fails to complete the migration but never surfaces the underlying cause to the user. Instead if waitForIndexStatusYellow reaches a timeout we should call GET _cluster/allocation/explain?index=${targetIndex} and log the response before retrying so that it's clear to users what might be the cause of this action continuously failing.

Based on https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html and output like below it's not easy to parse this API into a single human readable message. Sometimes it returns "allocate_explanation" sometimes "rebalance_explanation" and yet other times "move_explanation", therefore it seems better to just log the entire API output.

{
  "index" : ".kibana_8.1.0_001",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2021-11-17T16:50:07.473Z",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "RL3Eiz1RSpmzvAsdsSw3JQ",
      "node_name" : "node-01",
      "transport_address" : "127.0.0.1:9300",
      "node_attributes" : {
        "ml.machine_memory" : "34359738368",
        "xpack.installed" : "true",
        "ml.max_jvm_size" : "1610612736"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [11.692661332965082%]"
        }
      ]
    }
  ]
}

This will most likely affect users with a cluster where disk space exceeds the low watermark, but could also help pinpoint the problem that's preventing migrations to succeed in other unhealthy clusters.

@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Nov 17, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf
Copy link
Contributor Author

rudolf commented Apr 6, 2022

Because the output of this API isn't easy to understand and act on, and because the JSON output would be even harder to read when it's not pretty printed as part of a server log line, I don't think this would really help users.

Instead, linking to documentation that asks users to call the cluster allocation explain API for themselves could be more helpful since they'll see the pretty printed output.

@pgayvallet
Copy link
Contributor

I agree. Should we close this current issue then, given documenting the problem is part of #128585?

@rudolf
Copy link
Contributor Author

rudolf commented Apr 11, 2022

👍 Closing in favour of #128585

@rudolf rudolf closed this as completed Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Migrations impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

3 participants