migrations fail with "Timeout waiting for the status of the [.kibana_{VERSION}_reindex_temp] index to become 'yellow'" #128585

pgayvallet · 2022-03-28T08:11:59Z

We've observed some Kibana upgrades to 7.17+ can fail with:

[.kibana_task_manager] CREATE_REINDEX_TEMP -> CREATE_REINDEX_TEMP. took: 124009ms
[.kibana_task_manager] Action failed with 'Timeout waiting for the status of the [.kibana_task_manager_7.17.1_reindex_temp] index to become 'yellow''. Retrying attempt 11 in 64 seconds.

Note that it’s outside of the control of Kibana if indices aren’t allocated, so this issue is mostly to track the problem.

We know of two potential root causes:

Cluster hit the low watermark for disk usage #116616
Clusters have routing allocation disabled #124139

For both of these issues we cannot fix or work around the problem, the best we can do is make sure our logs clearly explain the problem so that users can fix it without opening a support ticket. One idea we have is to log the output of _cluster/allocation/explain when waitForIndexStatusYellow times out.

We could also potentially try to surface the problem either in upgrade assistant, or in the health API.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-03-28T08:12:01Z

Pinging @elastic/kibana-core (Team:Core)

pgayvallet · 2022-03-31T08:51:08Z

From #129016:

We want to:

In any case:

Be able to identify it, and to assign a unique error code to it

Add online documentation describing how to fix, or work around, the failure

it can either be one page per failure or one page listing all the failures, TBD

Surface the error code, and the link to the documentation, in the failure's log

When the failure's cause can be predetermined:

fail-fast during the migration

surface the problem in Upgrade Assistant

pgayvallet · 2022-03-31T08:53:44Z

Part of it was already done in #126612: we're failing fast when cluster allocation is disabled.

Still need to add the proper error code, documentation and surface the problem in UA.

Plus, we would need to handle the low watermark scenario.

Note: disabled allocation and low watermark should each have their own error code, as the resolutions are different.

TinaHeiligers · 2022-04-05T20:43:05Z

Still need to add the proper error code, documentation and surface the problem in UA

I commented in #129016 but mentioned it here again: We need a strategy for assigning error codes, have we already aligned on one?

Also, linking to the online documentation should be easier once we've handled #126864. For now, the best we can start with is link to the docs we already have or, if they're yet to be written, link to the ES docs and come back later to update the logs once we have written them.
Then again, we could just write the docs first and then tackle the logs improvements.

TinaHeiligers · 2022-04-05T21:53:00Z

I'm trying to figure out the best approach here to cover the most ground and thinking "aloud" about the implementation.

From how I'm seeing this, we already fail fast when there's an issue with cluster routing allocation.
That should be sufficient to cover the call to create the temporary index too, or do we need to also do a pre-emptive check before trying to create the index? I can't think of a way the migrations could be interrupted and cluster routing allocation changed mid-way.

I'll add a new action for _cluster/allocation/explain and combine that (somehow) into the existing create_index task.
@pgayvallet @rudolf How does that sound?

pgayvallet · 2022-04-06T06:30:40Z

We need a strategy for assigning error codes, have we already aligned on one

replied in #129016 (comment)

From how I'm seeing this, we already fail fast when there's an issue with cluster routing allocation.
That should be sufficient to cover the call to create the temporary index too, or do we need to also do a pre-emptive check before trying to create the index?

Ihmo that's sufficient. The only thing is, I think we need a mechanism to make sure that if a problem is detected, either proactively in our 'fail fast' logic OR later during the actual migration steps, we will always be able to identify the error.

E.g for cluster allocation, we ideally want to be sure that the error will be properly identified (and the same error code used/surfaced) in both scenarios:

If we detect that cluster allocation is disabled in our 'fail fast' step
or if (for any reason) the error occurs during the next stages of the migration

Also, linking to the online documentation should be easier once we've handled #126864

Wiring up / using the new server-side docLink service from within the migration should be trivial (it's just a matter of providing it as a dependency to the migration system), so I was thinking of doing that in the scope of this/these issues. But that's not mandatory, hardcoded links can also work temporarily.

For now, the best we can start with is link to the docs we already have or, if they're yet to be written

Not sure to follow. Writing docs to explain how to fix or work around the problem is part of the expected outcome of these issues, see #128585 (comment)

- Add online documentation describing how to fix, or work around, the failure

rudolf · 2022-04-06T11:13:30Z

I'm not sure if adding an error code has much value. Error codes are usually used as a way to lookup more information about an error, but if we include a doc link that should fix that problem. I believe there was an internal Elasticsearch discussion around introducing error codes and they decided against it.

Understanding why a shard is unassigned (index doesn't become yellow) is actually very hard, the ES team hopes to address that through the health API some early internal notes about this

Because the output of this API is rather hard to read I'm not sure we will be adding much value in automatically calling it. So for the time being I think we can link to documentation that explains common reasons like disk usage and suggests calling the cluster allocation explain API. Once the ES team has a more useful API we can either call that or update our documentation.

TinaHeiligers · 2022-04-06T18:27:42Z

Thank you both! For now, I'm going to integrate the doclinks service and use that to point to the online documentation.

Writing docs to explain how to fix or work around the problem is part of the expected outcome of these issues

What I meant was that this might have to happen in two steps for the new section to handle this issue because we can't link to an online doc if it doesn't exist online yet (unless that's changed). The last time I tried to add docs and link to it in the same issue it didn't work. I'll take care of it.

output of this API is rather hard to read

So true! After spending a couple of hours adding the new task for calling the API and trying to figure out a good way to output that, the response depends on too many cluster state conditions. It's hard to predict and display in a human-readable manner.

pgayvallet · 2022-04-07T11:49:33Z

Error codes are usually used as a way to lookup more information about an error, but if we include a doc link that should fix that problem. I believe there was an internal Elasticsearch discussion around introducing error codes and they decided against it

I'm not sure to see any potential downsides of having error codes tbh (but maybe you have some?), given we will be implementing the logic to uniquely identify them anyway to point to the correct documentation.

I can only think about upsides, for instance:

easier way to categorize errors when we'll be using EBT, better analytics (what would we be using else to identify / group by a failure cause in our telemetry cluster? the full error label/message?)
allowing users to retrieve the documentation if for any reason they only have the failure message and not the link (e.g 'eh Marc, please help me, I found error XXX on my cluster, what can I do?' 'well, Frank, LMGTFY')

Overall, that's a best practice largely adopted by the industry, and, as the link shows, can even help adding proper SEO's info in our documentation pages for indexation engines to use.

But I don't really mind either way, we can follow the regular 'consistency across the stack' philosophy here if we want to. Should the documentation team weight in here maybe?

rudolf · 2022-04-07T14:17:34Z

I was mostly trying to avoid having to come up with an error id/code naming scheme and having to maintain a database of unique error codes and coming up with a way to select/create new id's.

I managed to find the discussion https://github.com/elastic/elasticsearch-adrs/pull/54/files and related design doc https://github.com/elastic/elasticsearch-adrs/issues/22

Even if ES doesn't use error codes, they do often have error labels (they call it error type) like snapshot_in_progress_exception. Although it's possible to categorize by URL, I can also see the value in having a short piece of text that summarises what went wrong like index_not_yellow_timeout.

ECS contains an error.id field, so we should try to see if we could leverage that in our logs so that we could also categorize error logs.

pgayvallet · 2022-04-08T09:04:04Z

Even if ES doesn't use error codes, they do often have error labels (they call it error type) like snapshot_in_progress_exception

This would be perfectly fine with me, what I really meant by 'error id' is a unique identifier for our error (see #129016 (comment)). That identifier being a number or a label doesn't seem that much relevant (actually an error label is likely better as being more explicit than a number)

pgayvallet added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Feature:Migrations labels Mar 28, 2022

pgayvallet mentioned this issue Mar 31, 2022

[meta] Better handling of the most common migration issues #129016

Closed

4 tasks

exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort labels Mar 31, 2022

lukeelmers assigned TinaHeiligers Apr 5, 2022

exalate-issue-sync bot added loe:large Large Level of Effort loe:medium Medium Level of Effort and removed loe:medium Medium Level of Effort loe:large Large Level of Effort labels Apr 5, 2022

pgayvallet mentioned this issue Apr 8, 2022

migrations v2: check _cluster/allocation/explain when waitForIndexStatusYellow times out #118934

Closed

TinaHeiligers mentioned this issue Apr 11, 2022

Adds documentation for migrations failing on timeouts while waiting for index yellow status #129949

Closed

3 tasks

exalate-issue-sync bot added the loe:medium Medium Level of Effort label Apr 11, 2022

TinaHeiligers mentioned this issue Apr 15, 2022

Adds documentation and improves migrations failing on timeouts while waiting for index yellow status #130352

Merged

3 tasks

exalate-issue-sync bot closed this as completed Apr 21, 2022

exalate-issue-sync bot reopened this Apr 21, 2022

exalate-issue-sync bot closed this as completed Apr 21, 2022

TinaHeiligers mentioned this issue Apr 21, 2022

Improve saved objects migrations failure errors and logs #130837

Closed

TinaHeiligers mentioned this issue May 2, 2022

Improve saved objects migrations failure errors and logs #131359

Merged

3 tasks

TinaHeiligers mentioned this issue Sep 1, 2022

Add unsupported cluster routing allocation warning #139863

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migrations fail with "Timeout waiting for the status of the [.kibana_{VERSION}_reindex_temp] index to become 'yellow'" #128585

migrations fail with "Timeout waiting for the status of the [.kibana_{VERSION}_reindex_temp] index to become 'yellow'" #128585

pgayvallet commented Mar 28, 2022 •

edited

Loading

elasticmachine commented Mar 28, 2022

pgayvallet commented Mar 31, 2022

pgayvallet commented Mar 31, 2022 •

edited

Loading

TinaHeiligers commented Apr 5, 2022 •

edited

Loading

TinaHeiligers commented Apr 5, 2022 •

edited

Loading

pgayvallet commented Apr 6, 2022 •

edited

Loading

rudolf commented Apr 6, 2022

TinaHeiligers commented Apr 6, 2022 •

edited

Loading

pgayvallet commented Apr 7, 2022 •

edited

Loading

rudolf commented Apr 7, 2022

pgayvallet commented Apr 8, 2022

migrations fail with "Timeout waiting for the status of the [.kibana_{VERSION}_reindex_temp] index to become 'yellow'" #128585

migrations fail with "Timeout waiting for the status of the [.kibana_{VERSION}_reindex_temp] index to become 'yellow'" #128585

Comments

pgayvallet commented Mar 28, 2022 • edited Loading

elasticmachine commented Mar 28, 2022

pgayvallet commented Mar 31, 2022

pgayvallet commented Mar 31, 2022 • edited Loading

TinaHeiligers commented Apr 5, 2022 • edited Loading

TinaHeiligers commented Apr 5, 2022 • edited Loading

pgayvallet commented Apr 6, 2022 • edited Loading

rudolf commented Apr 6, 2022

TinaHeiligers commented Apr 6, 2022 • edited Loading

pgayvallet commented Apr 7, 2022 • edited Loading

rudolf commented Apr 7, 2022

pgayvallet commented Apr 8, 2022

pgayvallet commented Mar 28, 2022 •

edited

Loading

pgayvallet commented Mar 31, 2022 •

edited

Loading

TinaHeiligers commented Apr 5, 2022 •

edited

Loading

TinaHeiligers commented Apr 5, 2022 •

edited

Loading

pgayvallet commented Apr 6, 2022 •

edited

Loading

TinaHeiligers commented Apr 6, 2022 •

edited

Loading

pgayvallet commented Apr 7, 2022 •

edited

Loading