-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
migrations fail with "Timeout waiting for the status of the [.kibana_{VERSION}_reindex_temp] index to become 'yellow'" #128585
Comments
Pinging @elastic/kibana-core (Team:Core) |
From #129016:
|
Part of it was already done in #126612: we're failing fast when cluster allocation is disabled. Still need to add the proper error code, documentation and surface the problem in UA. Plus, we would need to handle the low watermark scenario. Note: disabled allocation and low watermark should each have their own error code, as the resolutions are different. |
I commented in #129016 but mentioned it here again: We need a strategy for assigning error codes, have we already aligned on one? Also, linking to the online documentation should be easier once we've handled #126864. For now, the best we can start with is link to the docs we already have or, if they're yet to be written, link to the ES docs and come back later to update the logs once we have written them. |
I'm trying to figure out the best approach here to cover the most ground and thinking "aloud" about the implementation. From how I'm seeing this, we already fail fast when there's an issue with cluster routing allocation. I'll add a new action for |
replied in #129016 (comment)
Ihmo that's sufficient. The only thing is, I think we need a mechanism to make sure that if a problem is detected, either proactively in our 'fail fast' logic OR later during the actual migration steps, we will always be able to identify the error. E.g for cluster allocation, we ideally want to be sure that the error will be properly identified (and the same error code used/surfaced) in both scenarios:
Wiring up / using the new server-side docLink service from within the migration should be trivial (it's just a matter of providing it as a dependency to the migration system), so I was thinking of doing that in the scope of this/these issues. But that's not mandatory, hardcoded links can also work temporarily.
Not sure to follow. Writing docs to explain how to fix or work around the problem is part of the expected outcome of these issues, see #128585 (comment)
|
I'm not sure if adding an error code has much value. Error codes are usually used as a way to lookup more information about an error, but if we include a doc link that should fix that problem. I believe there was an internal Elasticsearch discussion around introducing error codes and they decided against it. Understanding why a shard is unassigned (index doesn't become yellow) is actually very hard, the ES team hopes to address that through the health API some early internal notes about this Because the output of this API is rather hard to read I'm not sure we will be adding much value in automatically calling it. So for the time being I think we can link to documentation that explains common reasons like disk usage and suggests calling the cluster allocation explain API. Once the ES team has a more useful API we can either call that or update our documentation. |
Thank you both! For now, I'm going to integrate the doclinks service and use that to point to the online documentation.
What I meant was that this might have to happen in two steps for the new section to handle this issue because we can't link to an online doc if it doesn't exist online yet (unless that's changed). The last time I tried to add docs and link to it in the same issue it didn't work. I'll take care of it.
So true! After spending a couple of hours adding the new task for calling the API and trying to figure out a good way to output that, the response depends on too many cluster state conditions. It's hard to predict and display in a human-readable manner. |
I'm not sure to see any potential downsides of having error codes tbh (but maybe you have some?), given we will be implementing the logic to uniquely identify them anyway to point to the correct documentation. I can only think about upsides, for instance:
Overall, that's a best practice largely adopted by the industry, and, as the link shows, can even help adding proper SEO's info in our documentation pages for indexation engines to use. But I don't really mind either way, we can follow the regular 'consistency across the stack' philosophy here if we want to. Should the documentation team weight in here maybe? |
I was mostly trying to avoid having to come up with an error id/code naming scheme and having to maintain a database of unique error codes and coming up with a way to select/create new id's. I managed to find the discussion https://github.com/elastic/elasticsearch-adrs/pull/54/files and related design doc https://github.com/elastic/elasticsearch-adrs/issues/22 Even if ES doesn't use error codes, they do often have error labels (they call it error type) like ECS contains an |
This would be perfectly fine with me, what I really meant by 'error id' is a unique identifier for our error (see #129016 (comment)). That identifier being a number or a label doesn't seem that much relevant (actually an error label is likely better as being more explicit than a number) |
Part of #129016
We've observed some Kibana upgrades to
7.17+
can fail with:Note that it’s outside of the control of Kibana if indices aren’t allocated, so this issue is mostly to track the problem.
We know of two potential root causes:
For both of these issues we cannot fix or work around the problem, the best we can do is make sure our logs clearly explain the problem so that users can fix it without opening a support ticket. One idea we have is to log the output of _cluster/allocation/explain when waitForIndexStatusYellow times out.
We could also potentially try to surface the problem either in upgrade assistant, or in the health API.
The text was updated successfully, but these errors were encountered: