Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrations fail with "Timeout waiting for the status of the [.kibana_{VERSION}_reindex_temp] index to become 'yellow'" #128585

Closed
Tracked by #129016
pgayvallet opened this issue Mar 28, 2022 · 11 comments
Assignees
Labels
Feature:Migrations impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@pgayvallet
Copy link
Contributor

pgayvallet commented Mar 28, 2022

Part of #129016

We've observed some Kibana upgrades to 7.17+ can fail with:

[.kibana_task_manager] CREATE_REINDEX_TEMP -> CREATE_REINDEX_TEMP. took: 124009ms
[.kibana_task_manager] Action failed with 'Timeout waiting for the status of the [.kibana_task_manager_7.17.1_reindex_temp] index to become 'yellow''. Retrying attempt 11 in 64 seconds.

Note that it’s outside of the control of Kibana if indices aren’t allocated, so this issue is mostly to track the problem.

We know of two potential root causes:

  • Cluster hit the low watermark for disk usage #116616
  • Clusters have routing allocation disabled #124139

For both of these issues we cannot fix or work around the problem, the best we can do is make sure our logs clearly explain the problem so that users can fix it without opening a support ticket. One idea we have is to log the output of _cluster/allocation/explain when waitForIndexStatusYellow times out.

We could also potentially try to surface the problem either in upgrade assistant, or in the health API.

@pgayvallet pgayvallet added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Feature:Migrations labels Mar 28, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@pgayvallet
Copy link
Contributor Author

From #129016:

We want to:

In any case:

  • Be able to identify it, and to assign a unique error code to it
  • Add online documentation describing how to fix, or work around, the failure
    • it can either be one page per failure or one page listing all the failures, TBD
  • Surface the error code, and the link to the documentation, in the failure's log

When the failure's cause can be predetermined:

  • fail-fast during the migration
  • surface the problem in Upgrade Assistant

@pgayvallet
Copy link
Contributor Author

pgayvallet commented Mar 31, 2022

Part of it was already done in #126612: we're failing fast when cluster allocation is disabled.

Still need to add the proper error code, documentation and surface the problem in UA.

Plus, we would need to handle the low watermark scenario.

Note: disabled allocation and low watermark should each have their own error code, as the resolutions are different.

@exalate-issue-sync exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort labels Mar 31, 2022
@exalate-issue-sync exalate-issue-sync bot added loe:large Large Level of Effort loe:medium Medium Level of Effort and removed loe:medium Medium Level of Effort loe:large Large Level of Effort labels Apr 5, 2022
@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Apr 5, 2022

Still need to add the proper error code, documentation and surface the problem in UA

I commented in #129016 but mentioned it here again: We need a strategy for assigning error codes, have we already aligned on one?

Also, linking to the online documentation should be easier once we've handled #126864. For now, the best we can start with is link to the docs we already have or, if they're yet to be written, link to the ES docs and come back later to update the logs once we have written them.
Then again, we could just write the docs first and then tackle the logs improvements.

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Apr 5, 2022

I'm trying to figure out the best approach here to cover the most ground and thinking "aloud" about the implementation.

From how I'm seeing this, we already fail fast when there's an issue with cluster routing allocation.
That should be sufficient to cover the call to create the temporary index too, or do we need to also do a pre-emptive check before trying to create the index? I can't think of a way the migrations could be interrupted and cluster routing allocation changed mid-way.

I'll add a new action for _cluster/allocation/explain and combine that (somehow) into the existing create_index task.
@pgayvallet @rudolf How does that sound?

@pgayvallet
Copy link
Contributor Author

pgayvallet commented Apr 6, 2022

We need a strategy for assigning error codes, have we already aligned on one

replied in #129016 (comment)

From how I'm seeing this, we already fail fast when there's an issue with cluster routing allocation.
That should be sufficient to cover the call to create the temporary index too, or do we need to also do a pre-emptive check before trying to create the index?

Ihmo that's sufficient. The only thing is, I think we need a mechanism to make sure that if a problem is detected, either proactively in our 'fail fast' logic OR later during the actual migration steps, we will always be able to identify the error.

E.g for cluster allocation, we ideally want to be sure that the error will be properly identified (and the same error code used/surfaced) in both scenarios:

  • If we detect that cluster allocation is disabled in our 'fail fast' step
  • or if (for any reason) the error occurs during the next stages of the migration

Also, linking to the online documentation should be easier once we've handled #126864

Wiring up / using the new server-side docLink service from within the migration should be trivial (it's just a matter of providing it as a dependency to the migration system), so I was thinking of doing that in the scope of this/these issues. But that's not mandatory, hardcoded links can also work temporarily.

For now, the best we can start with is link to the docs we already have or, if they're yet to be written

Not sure to follow. Writing docs to explain how to fix or work around the problem is part of the expected outcome of these issues, see #128585 (comment)

- Add online documentation describing how to fix, or work around, the failure

@rudolf
Copy link
Contributor

rudolf commented Apr 6, 2022

I'm not sure if adding an error code has much value. Error codes are usually used as a way to lookup more information about an error, but if we include a doc link that should fix that problem. I believe there was an internal Elasticsearch discussion around introducing error codes and they decided against it.

Understanding why a shard is unassigned (index doesn't become yellow) is actually very hard, the ES team hopes to address that through the health API some early internal notes about this

Because the output of this API is rather hard to read I'm not sure we will be adding much value in automatically calling it. So for the time being I think we can link to documentation that explains common reasons like disk usage and suggests calling the cluster allocation explain API. Once the ES team has a more useful API we can either call that or update our documentation.

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Apr 6, 2022

Thank you both! For now, I'm going to integrate the doclinks service and use that to point to the online documentation.

Writing docs to explain how to fix or work around the problem is part of the expected outcome of these issues

What I meant was that this might have to happen in two steps for the new section to handle this issue because we can't link to an online doc if it doesn't exist online yet (unless that's changed). The last time I tried to add docs and link to it in the same issue it didn't work. I'll take care of it.

output of this API is rather hard to read

So true! After spending a couple of hours adding the new task for calling the API and trying to figure out a good way to output that, the response depends on too many cluster state conditions. It's hard to predict and display in a human-readable manner.

@pgayvallet
Copy link
Contributor Author

pgayvallet commented Apr 7, 2022

Error codes are usually used as a way to lookup more information about an error, but if we include a doc link that should fix that problem. I believe there was an internal Elasticsearch discussion around introducing error codes and they decided against it

I'm not sure to see any potential downsides of having error codes tbh (but maybe you have some?), given we will be implementing the logic to uniquely identify them anyway to point to the correct documentation.

I can only think about upsides, for instance:

  • easier way to categorize errors when we'll be using EBT, better analytics (what would we be using else to identify / group by a failure cause in our telemetry cluster? the full error label/message?)
  • allowing users to retrieve the documentation if for any reason they only have the failure message and not the link (e.g 'eh Marc, please help me, I found error XXX on my cluster, what can I do?' 'well, Frank, LMGTFY')

Overall, that's a best practice largely adopted by the industry, and, as the link shows, can even help adding proper SEO's info in our documentation pages for indexation engines to use.

But I don't really mind either way, we can follow the regular 'consistency across the stack' philosophy here if we want to. Should the documentation team weight in here maybe?

@rudolf
Copy link
Contributor

rudolf commented Apr 7, 2022

I was mostly trying to avoid having to come up with an error id/code naming scheme and having to maintain a database of unique error codes and coming up with a way to select/create new id's.

I managed to find the discussion https://github.com/elastic/elasticsearch-adrs/pull/54/files and related design doc https://github.com/elastic/elasticsearch-adrs/issues/22

Even if ES doesn't use error codes, they do often have error labels (they call it error type) like snapshot_in_progress_exception. Although it's possible to categorize by URL, I can also see the value in having a short piece of text that summarises what went wrong like index_not_yellow_timeout.

ECS contains an error.id field, so we should try to see if we could leverage that in our logs so that we could also categorize error logs.

@pgayvallet
Copy link
Contributor Author

Even if ES doesn't use error codes, they do often have error labels (they call it error type) like snapshot_in_progress_exception

This would be perfectly fine with me, what I really meant by 'error id' is a unique identifier for our error (see #129016 (comment)). That identifier being a number or a label doesn't seem that much relevant (actually an error label is likely better as being more explicit than a number)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Migrations impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
4 participants