Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve migrations performance by using distinct indices per SO type #104081

Closed
pgayvallet opened this issue Jul 1, 2021 · 7 comments · Fixed by #154888
Closed

Improve migrations performance by using distinct indices per SO type #104081

pgayvallet opened this issue Jul 1, 2021 · 7 comments · Fixed by #154888
Assignees
Labels
discuss Epic:ScaleMigrations Scale upgrade migrations to millions of saved objects Feature:Saved Objects loe:x-large Extra Large Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@pgayvallet
Copy link
Contributor

pgayvallet commented Jul 1, 2021

This is a follow-up on [https://elasticco.atlassian.net/browse/KBNA-7838|https://elasticco.atlassian.net/browse/KBNA-7838|smart-link]

In this phase we propose to split some saved object types out of the .kibana index into separate indices

[https://github.com//issues/104081|https://github.com//issues/104081|smart-link]
While prior work can reduce downtime for some upgrades, if just one saved object type defines a migration all saved objects still need to be migrated. E.g. there might be no cases migrations defined but there is a dashboard migration which then requires us to migrate all 1m cases. This change would mean we would only migrate the 1m cases if there is a cases migration defined. While this reduces the average downtime per upgrade it does introduce unpredictability for users where some upgrades are fast and others cause 10minutes of downtime.

We already thought a lot about splitting the kibana index into multiple ones, either one index by type, or potentially one index per solution (#90817). If we were to go into that direction, we could potentially increase the migration speed significantly, by only performing the migration process on the indices containing documents that need to be migrated.

As an example, let’s say that we got 3 SO types, foo, bar and dolly, respectively in the .kibana_foo_8_0_0, .kibana_bar_8_0_0 and .kibana_dolly_8_0_0 indices

During the migration from v8.0.0 to v8.2.1, we detect that we only have migrations to apply to the foo type. We could then only apply the migration to the .kibana_foo_8_0_0 index, and just clone the other indices to their new version.

We could even go further: If we keep inside an internal index the current version (name) of each index, we could potentially even avoid the cloning of the .kibana_bar_8_0_0 and .kibana_dolly_8_0_0 indices, and just have the next version use the exact same indices.

This could drastically reduce the average migration time, especially if we choose to have one index per type (as opposed to one index per grouping of types)

Note that this would have the additional advantage of handling the problematic of the increasing number of registered fields (and dodging once and for all the Damocles sword of hitting the limit), see #70471

@pgayvallet pgayvallet added discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Saved Objects project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Jul 1, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf
Copy link
Contributor

rudolf commented Jul 11, 2022

We should remember that the core usage collector for saved objects currently only collects saved objects per type for the "default saved objects index" https://github.com/elastic/kibana/blob/main/src/plugins/kibana_usage_collection/server/plugin.ts#L126

So we would need to refactor this collector to check the saved objects registry during start to find all the indices of all the types.

@rudolf rudolf added the loe:x-large Extra Large Level of Effort label Oct 26, 2022
@exalate-issue-sync exalate-issue-sync bot changed the title [SO Migration] Improve performance by using distinct indices per type migrations: Improve performance by using distinct indices per type Nov 10, 2022
@exalate-issue-sync exalate-issue-sync bot changed the title migrations: Improve performance by using distinct indices per type Improve performance by using distinct indices per type Nov 11, 2022
@exalate-issue-sync exalate-issue-sync bot changed the title Improve performance by using distinct indices per type Design: Improve performance by using distinct indices per type Nov 11, 2022
@rudolf
Copy link
Contributor

rudolf commented Dec 21, 2022

Some initial thoughts around the design:
Screenshot 2022-12-21 at 13 41 19

Screenshot 2022-12-21 at 13 41 47

Screenshot 2022-12-21 at 13 41 58

Screenshot 2022-12-21 at 13 42 15

There's several options to accomplish this. Some things could easily be done in one state machine by having the action target all indices at once, like the REINDEX_* step. But this becomes complex when different indices require different actions, e.g. some indices need to be created for the first time, others need to be migrated from a legacy version and yet others might already have an index with compatible mappings.

So it feels simplest to keep separate state machines per index but allow each state machine to block before progressing to the next step until all other state machines are past a certain step.

@pgayvallet
Copy link
Contributor Author

pgayvallet commented Dec 22, 2022

Folllowing our discussions from yesterday, especially what @gsoldevila proposed, here's my brainstorm on the kind of trade-offs and optimization we could make:

If we look at the use cases, I think we have two main situations:

  1. we're migrating all types of a given index to the same index
  2. we're migrating some types of a given index to the same index and some other types from the said index to a non-existing index

Now, the main question I have is: Do we want/need to allow doing more than that?

E.g would we want to support something like

  • during 8.8, I move case-foo from .kibana to .kibana-cases (which is a new index I would be creating during this 8.8 migration)
  • during 8.9, I move case-bar from .kibana to .kibana-cases (which already exists)

My point being, if we define that:

  1. the action of creating a new index and moving types to it as 'atomical' and bound to a release (so basically you can't move types to an index that already exists)
  2. you can only move from one source index to a new index (e.g we can't move types from both .kibana and .kibana_task_manager to .kibana_foo)

I think it would significantly lower the needs in term of synchronization, as I think we could, as @gsoldevila suggested, have most of the work be performed in 'isolation' from a single migrator.

I'm not speaking in term of technicals here (the migration could instantiate other migrators or something similar) but in term of responsibilities / concerns here, as each migrator wouldn't need to sync or interact with the others

wdyt all? does that make sense?

@gsoldevila
Copy link
Contributor

gsoldevila commented Dec 22, 2022

If we look at the use cases, I think we have two main situations:

  1. we're migrating all types of a given index to the same index
  2. we're migrating some types of a given index to the same index and some other types from the said index to a non-existing index

If we think about how all that fits into serverless, none of them are going to take place in the near future anymore.
We won't reindex, and we won't move SO types around to different indices either. So in the end, we'll have a much simpler state machine once serverless is widespread.

I was imagining the "after-split" functioning mode as a sort of v3 migrator that would bring us very close to serverless. A simplified one that only knows about updating mappings in place.

With that in mind, I define 2 main scenarios:

  1. We have a new deployment, or one that it's already in v3 (split) state. In this scenario we start one v3 migrator per "solution" index (or whatever SO type grouping we decide).
  2. We have a v1 or v2 deployment. In this scenario, I'd tweak the v2 migrator to perform the split, using a single migrator in 'isolation'. This migrator would create all indices of the split, and fill each of them with the corresponding documents. This would be a one off, expensive operation. After the migration to v3 is finished, consecutive restarts will always use v3 logic, and no further breaking changes or splitting would be allowed.

This way, we can use the "splitting into separate indices" as an opportunity to take us a step closer to serverless. We'd also isolate the v3 logic in a separate "component", easier to maintain and update.

I think this also fits well with Pierre's suggestion of having moves bound to a release. After giving it some thought, Pierre's proposal is more flexible as it would allow different moves between indices in the future, which we could turn to in case of facing issues like mapping explosions. Nevertheless, that would come at the cost of some downtime, and we would also be keeping some (a priori) unnecessary complexity in the code.

@rudolf
Copy link
Contributor

rudolf commented Dec 23, 2022

Spoke to @gsoldevila and @jloleysens. We agreed that reducing the code complexity of the migrator (especially the model) would be good. At a high level we could split up v1/v2/v3 migrations into their own state machines. It would also be useful to have a state machine of state machines so that the INIT and CHECK_MIGRATION_VERSION could benefit from all the retry logic and logging of the state action machine. It could also help to split out some parts of e.g. a v2 state machine into sub-state machines.

(illustration from @gsoldevila)
Screenshot 2022-12-23 at 15 32 59

When we do such a refactoring is debatable. If type splitting doesn't require a lot of risky changes to the state machine that would add a lot of additional complexity it feels like we should push for reducing downtime first and then refactor once that's released.

When it comes to what a v3 migrator might look like this is a basic way to visualise the proposal. A single migrator instance that does it's actions for all the indices it manages e.g. creates 3 temp indices, clones 3 temp indices performs the MARK_VERSION_INDEX_READY for 3 indices etc.

Screenshot 2022-12-23 at 15 30 13

Let's call independant migratiors proposal (1) and this proposal (2). Both approaches require injecting the typeRegistry so that REINDEX_SOURCE_TO_TEMP_INDEX_BULK can create bulkWrite operations with the correct destination index for each document.

In (1) the coordination happens between the migrators in (2) the coordination happens within a migrator. But in both cases the coordination is only necessary inside a single Kibana instance, we're not talking about distributed synchronisation across instances.

I think (1) would require much less code changes though. We would leave all actions untouched but introduce two new steps READY_TO_REINDEX_SYNC and DONE_REINDEXING_SYNC. We would also have to inject a function into the migrators that allows them to query the controlState of other migrators this would require some changes but since it doesn't impact the algorithm it's low risk changes. And then in the *_SYNC steps we would wait indefinitely for all migrators to reach this state. To increase visibility we would want to model this as retries every 60s like READY_TO_REINDEX_SYNC -> READY_TO_REINDEX_SYNC so that we can see the log activity instead of it looking like nothing is happening.

The previous illustrations were kinda mixing the state machine controlState and the writing of documents into indexes which made it harder to understand, so I updated that to hopefully make that clearer
Screenshot 2022-12-23 at 15 51 27

I also agree with the previous comments that we don't need to design for a flexibly splitting strategy that can change if we decide to split types differently in the future. This is out of scope and if we can simplify the implementation without it that'd be fine. But I think it's possible without any further change (illustrated here with the orange arrow which moves a document out of the cases index and back to the .kibana index.

@rudolf
Copy link
Contributor

rudolf commented Dec 23, 2022

We also discussed why a clone is necessary even if it's a newly split index because it avoids lost deletes when theres multiple instances.

Screenshot 2022-12-23 at 16 03 51

@rudolf rudolf changed the title Design: Improve performance by using distinct indices per type Improve performance by using distinct indices per type Feb 6, 2023
@rudolf rudolf reopened this Feb 6, 2023
@exalate-issue-sync exalate-issue-sync bot changed the title Improve performance by using distinct indices per type Improve migrations performance by using distinct indices per SO type Feb 10, 2023
@gsoldevila gsoldevila self-assigned this Mar 7, 2023
@rudolf rudolf removed the project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient label Apr 17, 2023
@rudolf rudolf added the Epic:ScaleMigrations Scale upgrade migrations to millions of saved objects label Apr 17, 2023
gsoldevila added a commit that referenced this issue Apr 25, 2023
## Description 

Fix #104081

This PR move some of the SO types from the `.kibana` index into the
following ones:
- `.kibana_alerting_cases`
- `.kibana_analytics`
- `.kibana_security_solution`
- `.kibana_ingest`

This split/reallocation will occur during the `8.8.0` Kibana upgrade
(*meaning: from any version older than `8.8.0` to any version greater or
equal to `8.8.0`*)

**This PR main changes are:**
- implement the changes required in the SO migration algorithm to
support this reallocation
- update the FTR tools (looking at you esArchiver) to support these new
indices
- update hardcoded references to `.kibana` and usage of the
`core.savedObjects.getKibanaIndex()` to use new APIs to target the
correct index/indices
- update FTR datasets, tests and utility accordingly 

## To reviewers

**Overall estimated risk of regressions: low**

But, still, please take the time to review changes in your code. The
parts of the production code that were the most impacted are the
telemetry collectors, as most of them were performing direct requests
against the `.kibana` index, so we had to adapt them. Most other
contributor-owned changes are in FTR tests and datasets.

If you think a type is misplaced (either we missed some types that
should be moved to a specific index, or some types were moved and
shouldn't have been) please tell us, and we'll fix the reallocation
either in this PR or in a follow-up.

## .Kibana split

The following new indices are introduced by this PR, with the following
SO types being moved to it. (any SO type not listed here will be staying
in its current index)

Note: The complete **_type => index_** breakdown is available in [this
spreadsheet](https://docs.google.com/spreadsheets/d/1b_MG_E_aBksZ4Vkd9cVayij1oBpdhvH4XC8NVlChiio/edit#gid=145920788).

#### `.kibana_alerting_cases`
- action
- action_task_params
- alert
- api_key_pending_invalidation
- cases
- cases-comments
- cases-configure
- cases-connector-mappings
- cases-telemetry
- cases-user-actions
- connector_token
- rules-settings
- maintenance-window

#### `.kibana_security_solution`
- csp-rule-template
- endpoint:user-artifact
- endpoint:user-artifact-manifest
- exception-list
- exception-list-agnostic
- osquery-manager-usage-metric
- osquery-pack
- osquery-pack-asset
- osquery-saved-query
- security-rule
- security-solution-signals-migration
- siem-detection-engine-rule-actions
- siem-ui-timeline
- siem-ui-timeline-note
- siem-ui-timeline-pinned-event

#### `.kibana_analytics`

- canvas-element
- canvas-workpad-template
- canvas-workpad
- dashboard
- graph-workspace
- index-pattern
- kql-telemetry
- lens
- lens-ui-telemetry
- map
- search
- search-session
- search-telemetry
- visualization

#### `.kibana_ingest`

- epm-packages
- epm-packages-assets
- fleet-fleet-server-host
- fleet-message-signing-keys
- fleet-preconfiguration-deletion-record
- fleet-proxy
- ingest_manager_settings
- ingest-agent-policies
- ingest-download-sources
- ingest-outputs
- ingest-package-policies

## Tasks / PRs

### Sub-PRs

**Implementation**
- 🟣 #154846
- 🟣 #154892
- 🟣 #154882
- 🟣 #154884
- 🟣 #155155

**Individual index split**
- 🟣 #154897
- 🟣 #155129
- 🟣 #155140
- 🟣 #155130

### Improvements / follow-ups 

- 👷🏼 Extract logic into
[runV2Migration](#154151 (comment))
@gsoldevila
- Make `getCurrentIndexTypesMap` resillient to intermittent failures
#154151 (comment)
- 🚧 Build a more structured
[MigratorSynchronizer](#154151 (comment))
- 🟣 #155035
- 🟣 #155116
- 🟣 #155366
## Reallocation tweaks

Tweaks to the reallocation can be done after the initial merge, as long
as it's done before the public release of 8.8

- `url` should get back to `.kibana` (see
[comment](#154888 (comment)))

## Release Note

For performance purposes, Kibana is now using more system indices to
store its internal data.

The following system indices will be created when upgrading to `8.8.0`:

- `.kibana_alerting_cases`
- `.kibana_analytics`
- `.kibana_security_solution`
- `.kibana_ingest`

---------

Co-authored-by: pgayvallet <[email protected]>
Co-authored-by: Christos Nasikas <[email protected]>
Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: Georgii Gorbachev <[email protected]>
rudolf pushed a commit that referenced this issue Jun 5, 2023
…locating SO documents (#158940)

Fixes #158733

The goal of this modification is to enforce migrators of all indices
involved in a relocation (e.g. as part of the [dot kibana
split](#104081)) to create the
index aliases in the same `updateAliases()` call.

This way, either:
* all the indices involved in the [dot kibana
split](#104081) relocation will
be completely upgraded (with the appropriate aliases).
* or none of them will.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Epic:ScaleMigrations Scale upgrade migrations to millions of saved objects Feature:Saved Objects loe:x-large Extra Large Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants