Scale upgrade migrations to millions of saved objects #144035

rudolf · 2022-10-26T13:18:27Z

When designing the v2 migration algorithm our objective was to have less than 10 minutes of downtime for 100k saved objects.

We have since learned that some customers have clusters with millions of saved objects, approaching 10m saved objects (11k spaces with 3.7m visualisations).

Compounding this problem is that some of these customers use alerting in a way that makes them very sensitive to downtime.

This issue describes our plans for reducing the downtime of upgrade and startup migrations for clusters at this scale.

Phases:

Reduce startup time by not running "patching migrations" every time kibana is started.
We will do this by comparing the md5sums of the fields for each of the saved object types. If the md5sums match we will not perform the UPDATE_TARGET_MAPPINGS step and the associated updateAndPickupMappings action. (We will still run the OUTDATED_DOCUMENTS_* steps because sometimes documents can be outdated even when mappings weren't changed.) [https://github.com/reduce startup time by skipping update mappings step #145743|https://github.com/reduce startup time by skipping update mappings step #145743]
Prevent convertToMultiNamespaceType migrations on versions > 8.0.0 [https://github.com/Prevent future convertToMultiNamespaceType migrations #147344|https://github.com/Prevent future convertToMultiNamespaceType migrations #147344]
Don't run a "full migration" on every upgrade. If the mappings haven't changed, we don't need a new index and we can just run the OUTDATED_DOCUMENTS_* steps [https://github.com/Only migrate an index if necessary #124946|https://github.com/Only migrate an index if necessary #124946]

Before:

After:

Further changes proposed are covered by

Split some saved object types out of the .kibana index into separate indices [https://elasticco.atlassian.net/browse/KBNA-4545|https://elasticco.atlassian.net/browse/KBNA-4545|smart-link] [dot-kibana-split] Allow relocating SO to different indices during migration #154846

While (2) can reduce downtime for some upgrades, if just one saved object type defines a migration all saved objects still need to be migrated. E.g. there might be no cases migrations defined but there is a dashboard migration which then requires us to migrate all 1m cases. This change would mean we would only migrate the 1m cases if there is a cases migration defined. While this reduces the average downtime per upgrade it does introduce unpredictability for users where some upgrades are fast and others cause 10minutes of downtime.

Only reindex into a new index if there were incompatible mappings changes between releases

[https://elasticco.atlassian.net/browse/KBNA-9053|https://elasticco.atlassian.net/browse/KBNA-9053|smart-link] #149326

Only run the pickup mappings update_by_query on saved objects that had mappings changes [Migrations] Only pickup updated SO types when performing a compatible migration #159962

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-10-26T13:18:46Z

Pinging @elastic/kibana-core (Team:Core)

…145604) The goal of this PR is to reduce the startup times of Kibana server by improving the migration logic. Fixes #145743 Related #144035) The migration logic is run systematically at startup, whether the customers are upgrading or not. Historically, these steps have been very quick, but we recently found out about some customers that have more than **one million** Saved Objects stored, making the overall startup process slow, even when there are no migrations to perform. This PR specifically targets the case where there are no migrations to perform, aka a Kibana node is started against an ES cluster that is already up to date wrt stack version and list of plugins. In this scenario, we aim at skipping the `UPDATE_TARGET_MAPPINGS` step of the migration logic, which internally runs the `updateAndPickupMappings` method, which turns out to be expensive if the system indices contain lots of SO. I locally tested the following scenarios too: - **Fresh install.** The step is not even run, as the `.kibana` index did not exist ✅ - **Stack version + list of plugins up to date.** Simply restarting Kibana after the fresh install. The step is run and leads to `DONE`, as the md5 hashes match those stored in `.kibana._mapping._meta` ✅ - **Faking re-enabling an old plugin.** I manually removed one of the MD5 hashes from the stored .kibana._mapping._meta through `curl`, and then restarted Kibana. The step is run and leads to `UPDATE_TARGET_MAPPINGS` as it used to before the PR ✅ - **Faking updating a plugin.** Same as the previous one, but altering an existing md5 stored in the metas. ✅ And that is the curl command used to tamper with the stored _meta: ```bash curl -X PUT "kibana:changeme@localhost:9200/.kibana/_mapping?pretty" -H 'Content-Type: application/json' -d' { "_meta": { "migrationMappingPropertyHashes": { "references": "7997cf5a56cc02bdc9c93361bde732b0", } } } ' ```

…lastic#145604) The goal of this PR is to reduce the startup times of Kibana server by improving the migration logic. Fixes elastic#145743 Related elastic#144035) The migration logic is run systematically at startup, whether the customers are upgrading or not. Historically, these steps have been very quick, but we recently found out about some customers that have more than **one million** Saved Objects stored, making the overall startup process slow, even when there are no migrations to perform. This PR specifically targets the case where there are no migrations to perform, aka a Kibana node is started against an ES cluster that is already up to date wrt stack version and list of plugins. In this scenario, we aim at skipping the `UPDATE_TARGET_MAPPINGS` step of the migration logic, which internally runs the `updateAndPickupMappings` method, which turns out to be expensive if the system indices contain lots of SO. I locally tested the following scenarios too: - **Fresh install.** The step is not even run, as the `.kibana` index did not exist ✅ - **Stack version + list of plugins up to date.** Simply restarting Kibana after the fresh install. The step is run and leads to `DONE`, as the md5 hashes match those stored in `.kibana._mapping._meta` ✅ - **Faking re-enabling an old plugin.** I manually removed one of the MD5 hashes from the stored .kibana._mapping._meta through `curl`, and then restarted Kibana. The step is run and leads to `UPDATE_TARGET_MAPPINGS` as it used to before the PR ✅ - **Faking updating a plugin.** Same as the previous one, but altering an existing md5 stored in the metas. ✅ And that is the curl command used to tamper with the stored _meta: ```bash curl -X PUT "kibana:changeme@localhost:9200/.kibana/_mapping?pretty" -H 'Content-Type: application/json' -d' { "_meta": { "migrationMappingPropertyHashes": { "references": "7997cf5a56cc02bdc9c93361bde732b0", } } } ' ``` (cherry picked from commit b1e18a0) # Conflicts: # packages/core/saved-objects/core-saved-objects-migration-server-internal/src/actions/index.ts

…ble (#145604) (#146637) # Backport This will backport the following commits from `main` to `8.6`: - [Reduce startup time by skipping update mappings step when possible (#145604)](#145604)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)

lukeelmers · 2023-05-18T21:15:38Z

Is this safe to close now that (4) and (5) are done?

rudolf · 2023-09-23T20:14:01Z

Yes, once users upgrade to 8.8 subsequent upgrade migrations should be a lot faster and more scalable.

rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Migrations labels Oct 26, 2022

rudolf added the loe:x-large Extra Large Level of Effort label Oct 26, 2022

rudolf mentioned this issue Nov 1, 2022

Reduce the impact of legacy-url-aliases introduced by sharing to multiple spaces #144358

Open

rudolf mentioned this issue Nov 17, 2022

Reduce startup time by skipping update mappings step when possible #145604

Merged

This was referenced Dec 13, 2022

Run embeddable migrations on read only #147445

Closed

Support large saved object indices consuming 10s of GBs #147852

Open

rayafratkina mentioned this issue Jan 11, 2023

Reliable Upgrades & Rollback #148761

Closed

rayafratkina added the Epic:KBNA-59 label Jan 11, 2023

rudolf added the Epic:ScaleMigrations Scale upgrade migrations to millions of saved objects label Jan 17, 2023

rudolf mentioned this issue Jan 31, 2023

Support for cross-type Saved Object migration #91143

Open

rayafratkina removed the Initiative:KBNA-59 label Feb 8, 2023

exalate-issue-sync bot changed the title ~~Scale upgrade migrations to millions of saved objects~~ Scale upgrade migrations to millions of saved objects: phase 1, limit migrations Feb 10, 2023

exalate-issue-sync bot changed the title ~~Scale upgrade migrations to millions of saved objects: phase 1, limit migrations~~ Scale upgrade migrations to millions of saved objects: phase 1-3, limit migrations Feb 10, 2023

lukeelmers mentioned this issue May 18, 2023

Document best practices for performing upgrades on large deployments #158118

Open

rudolf mentioned this issue Sep 23, 2023

Handle unreasonable amounts of Kibana index objects more gracefully during index migrating when upgrading #91768

Closed

rudolf changed the title ~~Scale upgrade migrations to millions of saved objects: phase 1-3, limit migrations~~ Scale upgrade migrations to millions of saved objects Sep 23, 2023

rudolf closed this as completed Sep 23, 2023

rudolf mentioned this issue Oct 3, 2023

Don't index so many saved object fields #43673

Closed

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale upgrade migrations to millions of saved objects #144035

Scale upgrade migrations to millions of saved objects #144035

rudolf commented Oct 26, 2022 •

edited

Loading

elasticmachine commented Oct 26, 2022

lukeelmers commented May 18, 2023

rudolf commented Sep 23, 2023

Scale upgrade migrations to millions of saved objects #144035

Scale upgrade migrations to millions of saved objects #144035

Comments

rudolf commented Oct 26, 2022 • edited Loading

elasticmachine commented Oct 26, 2022

lukeelmers commented May 18, 2023

rudolf commented Sep 23, 2023

rudolf commented Oct 26, 2022 •

edited

Loading