Measure impact of enabling UA on upgrade failures #148765

exalate-issue-sync · 2023-01-11T19:14:19Z

Issue scope:

Determine success metrics
Add missing measurements through EBT, telemetry, and logs
Assess the impact on 8.6+ upgrades
Recommendation and follow-ups

Determining success metrics: How do we know we’ve achieved the outcomes here?

Are people using UA before upgrading?
** Using Stack Telemetry, I created https://telemetry-v2-staging.elastic.dev/s/kibana-core/app/dashboards#/view/5ebf55d0-64d6-11ed-b77a-bd29ecb21612?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3A'2022-03-31T22%3A00%3A00.000Z'%2Cto%3Anow) on staging with the data available.
Could analyse Cloud 8.6+ upgrade failures, absence of shard limit or disk watermark failures would validate solution

Assess the impact on 8.6+ upgrades: create dashboards + show results

Use current upgrade failures Kibana dashboards
Build new visualizations for UA impact on upgrade failures
On staging: ingest EBT metrics and create EBT visualizations

Add missing measurements: EBT + snapshot telemetry

Is there a benefit to using EBT to report on the number of blocked upgrades caught be UA?
Do upgrade assistant checks show up in proxy logs? Could we use the status code to detect the amount of blocked upgrades as a positive validation?

Recommendations and follow-ups: Whats next?

On Cloud the use of upgrade assistant for minors is enforced, how do we educate on-prem users on the benefit of checking UA before a minor upgrade?
Do we need to do further actions to reduce failures? or is this satisfactory until we move into more and more into serverless?

elasticmachine · 2023-01-11T20:36:28Z

Pinging @elastic/kibana-core (Team:Core)

rudolf · 2023-05-02T11:22:57Z

I validated this by looking at the service constructor logs for upgrades with current_stack_version >= 8.6.1. There were 6 upgrade failures and all these upgrades failed because the cluster exceeded the shard limit. So this has not achieved the desired outcome.

Looking at the code Kibana assumes both conditions have a "critical" level but this is only true for the disk space watermarks, the shard limit creates a "warning" level deprecation https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/deprecation/src/test/java/org/elasticsearch/xpack/deprecation/ClusterDeprecationChecksTests.java#L76

We should fix this bug by adopting the new shard limit indicator in the health API #153051 and repeat the validation.

rayafratkina · 2024-04-01T14:09:37Z

@rudolf @Bamieh have we checked if the new implementation worked as expected? Looks like the last change went out with 8.10, so we should have enough data to make this determination

rudolf · 2024-07-25T13:28:19Z

I've analysed upgrades from 8.10 to 8.x in the last 3 months. We have not had any cluster_shard_limit_exceeded failures so we have achieved the outcome we aimed for 🥳

If upgrades fail due to insufficient disk space migrations would fail because of an unavailable shards error, but there could be other causes of this too. We continue to see many unavailable shards errors but in all analysed failures these were from indices that existed before the upgrade. So e.g. a user upgrades from 8.10 to 8.14 and we re-use the .kibana_analytics_8.6.0_001 index. However that index has an unassigned shard from before the upgrade therefore causing the upgrade to fail.

It's quite hard to establish the cause of an unassigned shard once it has been resolved, but in none of the cases I analysed ES reported high disk watermark warnings. So I'm reasonably confident that we achieved that outcome too.

botelastic bot added the needs-team Issues missing a team label label Jan 11, 2023

rayafratkina added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Epic:KBNA-59 labels Jan 11, 2023

botelastic bot removed the needs-team Issues missing a team label label Jan 11, 2023

rayafratkina mentioned this issue Jan 11, 2023

Upgrade Assistant block upgrades with unhealthy clusters #148762

Closed

rayafratkina assigned Bamieh Jan 31, 2023

rayafratkina removed the Initiative:KBNA-59 label Feb 8, 2023

rayafratkina added the Feature:Upgrade Assistant label Mar 26, 2024

rudolf assigned rudolf and unassigned Bamieh Jul 23, 2024

rudolf closed this as completed Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure impact of enabling UA on upgrade failures #148765

Measure impact of enabling UA on upgrade failures #148765

exalate-issue-sync bot commented Jan 11, 2023 •

edited by Bamieh

Loading

elasticmachine commented Jan 11, 2023

rudolf commented May 2, 2023

rayafratkina commented Apr 1, 2024 •

edited

Loading

rudolf commented Jul 25, 2024

Measure impact of enabling UA on upgrade failures #148765

Measure impact of enabling UA on upgrade failures #148765

Comments

exalate-issue-sync bot commented Jan 11, 2023 • edited by Bamieh Loading

Issue scope:

Determining success metrics: How do we know we’ve achieved the outcomes here?

Assess the impact on 8.6+ upgrades: create dashboards + show results

Add missing measurements: EBT + snapshot telemetry

Recommendations and follow-ups: Whats next?

elasticmachine commented Jan 11, 2023

rudolf commented May 2, 2023

rayafratkina commented Apr 1, 2024 • edited Loading

rudolf commented Jul 25, 2024

exalate-issue-sync bot commented Jan 11, 2023 •

edited by Bamieh

Loading

rayafratkina commented Apr 1, 2024 •

edited

Loading