Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale upgrade migrations to millions of saved objects #144035

Closed
rudolf opened this issue Oct 26, 2022 · 3 comments
Closed

Scale upgrade migrations to millions of saved objects #144035

rudolf opened this issue Oct 26, 2022 · 3 comments
Labels
Epic:ScaleMigrations Scale upgrade migrations to millions of saved objects Feature:Migrations loe:x-large Extra Large Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@rudolf
Copy link
Contributor

rudolf commented Oct 26, 2022

When designing the v2 migration algorithm our objective was to have less than 10 minutes of downtime for 100k saved objects.

We have since learned that some customers have clusters with millions of saved objects, approaching 10m saved objects (11k spaces with 3.7m visualisations).

Compounding this problem is that some of these customers use alerting in a way that makes them very sensitive to downtime.

This issue describes our plans for reducing the downtime of upgrade and startup migrations for clusters at this scale.

Phases:

  1. Reduce startup time by not running "patching migrations" every time kibana is started.
    We will do this by comparing the md5sums of the fields for each of the saved object types. If the md5sums match we will not perform the UPDATE_TARGET_MAPPINGS step and the associated updateAndPickupMappings action. (We will still run the OUTDATED_DOCUMENTS_* steps because sometimes documents can be outdated even when mappings weren't changed.) [https://github.com/reduce startup time by skipping update mappings step #145743|https://github.com/reduce startup time by skipping update mappings step #145743]
  2. Prevent convertToMultiNamespaceType migrations on versions > 8.0.0 [https://github.com/Prevent future convertToMultiNamespaceType migrations #147344|https://github.com/Prevent future convertToMultiNamespaceType migrations #147344]
  3. Don't run a "full migration" on every upgrade. If the mappings haven't changed, we don't need a new index and we can just run the OUTDATED_DOCUMENTS_* steps [https://github.com/Only migrate an index if necessary #124946|https://github.com/Only migrate an index if necessary #124946]

Before:
Screenshot 2023-05-02 at 20 22 54

After:
Screenshot 2023-05-02 at 20 24 21

Further changes proposed are covered by

  1. Split some saved object types out of the .kibana index into separate indices [https://elasticco.atlassian.net/browse/KBNA-4545|https://elasticco.atlassian.net/browse/KBNA-4545|smart-link] [dot-kibana-split] Allow relocating SO to different indices during migration #154846

While (2) can reduce downtime for some upgrades, if just one saved object type defines a migration all saved objects still need to be migrated. E.g. there might be no cases migrations defined but there is a dashboard migration which then requires us to migrate all 1m cases. This change would mean we would only migrate the 1m cases if there is a cases migration defined. While this reduces the average downtime per upgrade it does introduce unpredictability for users where some upgrades are fast and others cause 10minutes of downtime.

  1. Only reindex into a new index if there were incompatible mappings changes between releases

[https://elasticco.atlassian.net/browse/KBNA-9053|https://elasticco.atlassian.net/browse/KBNA-9053|smart-link] #149326

  1. Only run the pickup mappings update_by_query on saved objects that had mappings changes [Migrations] Only pickup updated SO types when performing a compatible migration #159962
@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Migrations labels Oct 26, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf rudolf added the loe:x-large Extra Large Level of Effort label Oct 26, 2022
gsoldevila added a commit that referenced this issue Nov 28, 2022
…145604)

The goal of this PR is to reduce the startup times of Kibana server by
improving the migration logic.

Fixes #145743
Related #144035)

The migration logic is run systematically at startup, whether the
customers are upgrading or not.
Historically, these steps have been very quick, but we recently found
out about some customers that have more than **one million** Saved
Objects stored, making the overall startup process slow, even when there
are no migrations to perform.

This PR specifically targets the case where there are no migrations to
perform, aka a Kibana node is started against an ES cluster that is
already up to date wrt stack version and list of plugins.

In this scenario, we aim at skipping the `UPDATE_TARGET_MAPPINGS` step
of the migration logic, which internally runs the
`updateAndPickupMappings` method, which turns out to be expensive if the
system indices contain lots of SO.


I locally tested the following scenarios too:

- **Fresh install.** The step is not even run, as the `.kibana` index
did not exist ✅
- **Stack version + list of plugins up to date.** Simply restarting
Kibana after the fresh install. The step is run and leads to `DONE`, as
the md5 hashes match those stored in `.kibana._mapping._meta` ✅
- **Faking re-enabling an old plugin.** I manually removed one of the
MD5 hashes from the stored .kibana._mapping._meta through `curl`, and
then restarted Kibana. The step is run and leads to
`UPDATE_TARGET_MAPPINGS` as it used to before the PR ✅
- **Faking updating a plugin.** Same as the previous one, but altering
an existing md5 stored in the metas. ✅

And that is the curl command used to tamper with the stored _meta:
```bash
curl -X PUT "kibana:changeme@localhost:9200/.kibana/_mapping?pretty" -H 'Content-Type: application/json' -d'
{
  "_meta": {
      "migrationMappingPropertyHashes": {
        "references": "7997cf5a56cc02bdc9c93361bde732b0",
      }
  }
}
'
```
gsoldevila added a commit to gsoldevila/kibana that referenced this issue Nov 29, 2022
…lastic#145604)

The goal of this PR is to reduce the startup times of Kibana server by
improving the migration logic.

Fixes elastic#145743
Related elastic#144035)

The migration logic is run systematically at startup, whether the
customers are upgrading or not.
Historically, these steps have been very quick, but we recently found
out about some customers that have more than **one million** Saved
Objects stored, making the overall startup process slow, even when there
are no migrations to perform.

This PR specifically targets the case where there are no migrations to
perform, aka a Kibana node is started against an ES cluster that is
already up to date wrt stack version and list of plugins.

In this scenario, we aim at skipping the `UPDATE_TARGET_MAPPINGS` step
of the migration logic, which internally runs the
`updateAndPickupMappings` method, which turns out to be expensive if the
system indices contain lots of SO.

I locally tested the following scenarios too:

- **Fresh install.** The step is not even run, as the `.kibana` index
did not exist ✅
- **Stack version + list of plugins up to date.** Simply restarting
Kibana after the fresh install. The step is run and leads to `DONE`, as
the md5 hashes match those stored in `.kibana._mapping._meta` ✅
- **Faking re-enabling an old plugin.** I manually removed one of the
MD5 hashes from the stored .kibana._mapping._meta through `curl`, and
then restarted Kibana. The step is run and leads to
`UPDATE_TARGET_MAPPINGS` as it used to before the PR ✅
- **Faking updating a plugin.** Same as the previous one, but altering
an existing md5 stored in the metas. ✅

And that is the curl command used to tamper with the stored _meta:
```bash
curl -X PUT "kibana:changeme@localhost:9200/.kibana/_mapping?pretty" -H 'Content-Type: application/json' -d'
{
  "_meta": {
      "migrationMappingPropertyHashes": {
        "references": "7997cf5a56cc02bdc9c93361bde732b0",
      }
  }
}
'
```

(cherry picked from commit b1e18a0)

# Conflicts:
#	packages/core/saved-objects/core-saved-objects-migration-server-internal/src/actions/index.ts
gsoldevila referenced this issue Nov 30, 2022
…ble (#145604) (#146637)

# Backport

This will backport the following commits from `main` to `8.6`:
- [Reduce startup time by skipping update mappings step when possible
(#145604)](#145604)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Gerard
Soldevila","email":"[email protected]"},"sourceCommit":{"committedDate":"2022-11-28T14:34:58Z","message":"Reduce
startup time by skipping update mappings step when possible
(#145604)\n\nThe goal of this PR is to reduce the startup times of
Kibana server by\r\nimproving the migration logic.\r\n\r\nFixes
https://github.com/elastic/kibana/issues/145743\r\nRelated
https://github.com/elastic/kibana/issues/144035)\r\n\r\nThe migration
logic is run systematically at startup, whether the\r\ncustomers are
upgrading or not.\r\nHistorically, these steps have been very quick, but
we recently found\r\nout about some customers that have more than **one
million** Saved\r\nObjects stored, making the overall startup process
slow, even when there\r\nare no migrations to perform.\r\n\r\nThis PR
specifically targets the case where there are no migrations
to\r\nperform, aka a Kibana node is started against an ES cluster that
is\r\nalready up to date wrt stack version and list of
plugins.\r\n\r\nIn this scenario, we aim at skipping the
`UPDATE_TARGET_MAPPINGS` step\r\nof the migration logic, which
internally runs the\r\n`updateAndPickupMappings` method, which turns out
to be expensive if the\r\nsystem indices contain lots of
SO.\r\n\r\n\r\nI locally tested the following scenarios too:\r\n\r\n-
**Fresh install.** The step is not even run, as the `.kibana`
index\r\ndid not exist ✅\r\n- **Stack version + list of plugins up to
date.** Simply restarting\r\nKibana after the fresh install. The step is
run and leads to `DONE`, as\r\nthe md5 hashes match those stored in
`.kibana._mapping._meta` ✅\r\n- **Faking re-enabling an old plugin.** I
manually removed one of the\r\nMD5 hashes from the stored
.kibana._mapping._meta through `curl`, and\r\nthen restarted Kibana. The
step is run and leads to\r\n`UPDATE_TARGET_MAPPINGS` as it used to
before the PR ✅\r\n- **Faking updating a plugin.** Same as the previous
one, but altering\r\nan existing md5 stored in the metas. ✅\r\n\r\nAnd
that is the curl command used to tamper with the stored
_meta:\r\n```bash\r\ncurl -X PUT
\"kibana:changeme@localhost:9200/.kibana/_mapping?pretty\" -H
'Content-Type: application/json' -d'\r\n{\r\n \"_meta\": {\r\n
\"migrationMappingPropertyHashes\": {\r\n \"references\":
\"7997cf5a56cc02bdc9c93361bde732b0\",\r\n }\r\n
}\r\n}\r\n'\r\n```","sha":"b1e18a0414ed99456706119d15173b687c6e7366","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Team:Core","enhancement","release_note:skip","Feature:Migrations","backport:prev-minor","v8.7.0"],"number":145604,"url":"https://github.com/elastic/kibana/pull/145604","mergeCommit":{"message":"Reduce
startup time by skipping update mappings step when possible
(#145604)\n\nThe goal of this PR is to reduce the startup times of
Kibana server by\r\nimproving the migration logic.\r\n\r\nFixes
https://github.com/elastic/kibana/issues/145743\r\nRelated
https://github.com/elastic/kibana/issues/144035)\r\n\r\nThe migration
logic is run systematically at startup, whether the\r\ncustomers are
upgrading or not.\r\nHistorically, these steps have been very quick, but
we recently found\r\nout about some customers that have more than **one
million** Saved\r\nObjects stored, making the overall startup process
slow, even when there\r\nare no migrations to perform.\r\n\r\nThis PR
specifically targets the case where there are no migrations
to\r\nperform, aka a Kibana node is started against an ES cluster that
is\r\nalready up to date wrt stack version and list of
plugins.\r\n\r\nIn this scenario, we aim at skipping the
`UPDATE_TARGET_MAPPINGS` step\r\nof the migration logic, which
internally runs the\r\n`updateAndPickupMappings` method, which turns out
to be expensive if the\r\nsystem indices contain lots of
SO.\r\n\r\n\r\nI locally tested the following scenarios too:\r\n\r\n-
**Fresh install.** The step is not even run, as the `.kibana`
index\r\ndid not exist ✅\r\n- **Stack version + list of plugins up to
date.** Simply restarting\r\nKibana after the fresh install. The step is
run and leads to `DONE`, as\r\nthe md5 hashes match those stored in
`.kibana._mapping._meta` ✅\r\n- **Faking re-enabling an old plugin.** I
manually removed one of the\r\nMD5 hashes from the stored
.kibana._mapping._meta through `curl`, and\r\nthen restarted Kibana. The
step is run and leads to\r\n`UPDATE_TARGET_MAPPINGS` as it used to
before the PR ✅\r\n- **Faking updating a plugin.** Same as the previous
one, but altering\r\nan existing md5 stored in the metas. ✅\r\n\r\nAnd
that is the curl command used to tamper with the stored
_meta:\r\n```bash\r\ncurl -X PUT
\"kibana:changeme@localhost:9200/.kibana/_mapping?pretty\" -H
'Content-Type: application/json' -d'\r\n{\r\n \"_meta\": {\r\n
\"migrationMappingPropertyHashes\": {\r\n \"references\":
\"7997cf5a56cc02bdc9c93361bde732b0\",\r\n }\r\n
}\r\n}\r\n'\r\n```","sha":"b1e18a0414ed99456706119d15173b687c6e7366"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/145604","number":145604,"mergeCommit":{"message":"Reduce
startup time by skipping update mappings step when possible
(#145604)\n\nThe goal of this PR is to reduce the startup times of
Kibana server by\r\nimproving the migration logic.\r\n\r\nFixes
https://github.com/elastic/kibana/issues/145743\r\nRelated
https://github.com/elastic/kibana/issues/144035)\r\n\r\nThe migration
logic is run systematically at startup, whether the\r\ncustomers are
upgrading or not.\r\nHistorically, these steps have been very quick, but
we recently found\r\nout about some customers that have more than **one
million** Saved\r\nObjects stored, making the overall startup process
slow, even when there\r\nare no migrations to perform.\r\n\r\nThis PR
specifically targets the case where there are no migrations
to\r\nperform, aka a Kibana node is started against an ES cluster that
is\r\nalready up to date wrt stack version and list of
plugins.\r\n\r\nIn this scenario, we aim at skipping the
`UPDATE_TARGET_MAPPINGS` step\r\nof the migration logic, which
internally runs the\r\n`updateAndPickupMappings` method, which turns out
to be expensive if the\r\nsystem indices contain lots of
SO.\r\n\r\n\r\nI locally tested the following scenarios too:\r\n\r\n-
**Fresh install.** The step is not even run, as the `.kibana`
index\r\ndid not exist ✅\r\n- **Stack version + list of plugins up to
date.** Simply restarting\r\nKibana after the fresh install. The step is
run and leads to `DONE`, as\r\nthe md5 hashes match those stored in
`.kibana._mapping._meta` ✅\r\n- **Faking re-enabling an old plugin.** I
manually removed one of the\r\nMD5 hashes from the stored
.kibana._mapping._meta through `curl`, and\r\nthen restarted Kibana. The
step is run and leads to\r\n`UPDATE_TARGET_MAPPINGS` as it used to
before the PR ✅\r\n- **Faking updating a plugin.** Same as the previous
one, but altering\r\nan existing md5 stored in the metas. ✅\r\n\r\nAnd
that is the curl command used to tamper with the stored
_meta:\r\n```bash\r\ncurl -X PUT
\"kibana:changeme@localhost:9200/.kibana/_mapping?pretty\" -H
'Content-Type: application/json' -d'\r\n{\r\n \"_meta\": {\r\n
\"migrationMappingPropertyHashes\": {\r\n \"references\":
\"7997cf5a56cc02bdc9c93361bde732b0\",\r\n }\r\n
}\r\n}\r\n'\r\n```","sha":"b1e18a0414ed99456706119d15173b687c6e7366"}}]}]
BACKPORT-->
@rudolf rudolf added the Epic:ScaleMigrations Scale upgrade migrations to millions of saved objects label Jan 17, 2023
@exalate-issue-sync exalate-issue-sync bot changed the title Scale upgrade migrations to millions of saved objects Scale upgrade migrations to millions of saved objects: phase 1, limit migrations Feb 10, 2023
@exalate-issue-sync exalate-issue-sync bot changed the title Scale upgrade migrations to millions of saved objects: phase 1, limit migrations Scale upgrade migrations to millions of saved objects: phase 1-3, limit migrations Feb 10, 2023
@lukeelmers
Copy link
Member

Is this safe to close now that (4) and (5) are done?

@rudolf rudolf changed the title Scale upgrade migrations to millions of saved objects: phase 1-3, limit migrations Scale upgrade migrations to millions of saved objects Sep 23, 2023
@rudolf
Copy link
Contributor Author

rudolf commented Sep 23, 2023

Yes, once users upgrade to 8.8 subsequent upgrade migrations should be a lot faster and more scalable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic:ScaleMigrations Scale upgrade migrations to millions of saved objects Feature:Migrations loe:x-large Extra Large Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

4 participants