Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure impact of enabling UA on upgrade failures #148765

Closed
exalate-issue-sync bot opened this issue Jan 11, 2023 · 4 comments
Closed

Measure impact of enabling UA on upgrade failures #148765

exalate-issue-sync bot opened this issue Jan 11, 2023 · 4 comments
Assignees
Labels
Feature:Upgrade Assistant Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@exalate-issue-sync
Copy link

exalate-issue-sync bot commented Jan 11, 2023

Issue scope:

  1. Determine success metrics
  2. Add missing measurements through EBT, telemetry, and logs
  3. Assess the impact on 8.6+ upgrades
  4. Recommendation and follow-ups

Determining success metrics: How do we know we’ve achieved the outcomes here?

Assess the impact on 8.6+ upgrades: create dashboards + show results

  • Use current upgrade failures Kibana dashboards
  • Build new visualizations for UA impact on upgrade failures
  • On staging: ingest EBT metrics and create EBT visualizations

Add missing measurements: EBT + snapshot telemetry

  • Is there a benefit to using EBT to report on the number of blocked upgrades caught be UA?
  • Do upgrade assistant checks show up in proxy logs? Could we use the status code to detect the amount of blocked upgrades as a positive validation?

Recommendations and follow-ups: Whats next?

  • On Cloud the use of upgrade assistant for minors is enforced, how do we educate on-prem users on the benefit of checking UA before a minor upgrade?
  • Do we need to do further actions to reduce failures? or is this satisfactory until we move into more and more into serverless?
@botelastic botelastic bot added the needs-team Issues missing a team label label Jan 11, 2023
@rayafratkina rayafratkina added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Epic:KBNA-59 labels Jan 11, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf
Copy link
Contributor

rudolf commented May 2, 2023

I validated this by looking at the service constructor logs for upgrades with current_stack_version >= 8.6.1. There were 6 upgrade failures and all these upgrades failed because the cluster exceeded the shard limit. So this has not achieved the desired outcome.

Looking at the code Kibana assumes both conditions have a "critical" level but this is only true for the disk space watermarks, the shard limit creates a "warning" level deprecation https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/deprecation/src/test/java/org/elasticsearch/xpack/deprecation/ClusterDeprecationChecksTests.java#L76

We should fix this bug by adopting the new shard limit indicator in the health API #153051 and repeat the validation.

@rayafratkina
Copy link
Contributor

rayafratkina commented Apr 1, 2024

@rudolf @Bamieh have we checked if the new implementation worked as expected? Looks like the last change went out with 8.10, so we should have enough data to make this determination

@rudolf rudolf assigned rudolf and unassigned Bamieh Jul 23, 2024
@rudolf
Copy link
Contributor

rudolf commented Jul 25, 2024

I've analysed upgrades from 8.10 to 8.x in the last 3 months. We have not had any cluster_shard_limit_exceeded failures so we have achieved the outcome we aimed for 🥳

If upgrades fail due to insufficient disk space migrations would fail because of an unavailable shards error, but there could be other causes of this too. We continue to see many unavailable shards errors but in all analysed failures these were from indices that existed before the upgrade. So e.g. a user upgrades from 8.10 to 8.14 and we re-use the .kibana_analytics_8.6.0_001 index. However that index has an unassigned shard from before the upgrade therefore causing the upgrade to fail.

It's quite hard to establish the cause of an unassigned shard once it has been resolved, but in none of the cases I analysed ES reported high disk watermark warnings. So I'm reasonably confident that we achieved that outcome too.

@rudolf rudolf closed this as completed Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Upgrade Assistant Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

4 participants