RFC Improve saved object migrations algorithm #84333

rudolf · 2020-11-25T14:40:11Z

Summary

1. Reindex the legacy index

Because the task manager convertToAlias script rewrites documents' _id we cannot clone the legacy and transform the documents with a script, we have to reindex.

2. Reindex for all migrations

We previously tested that mappings updates between versions of Kibana were compatible i.e. the mappings of a 7.1 index can be upgraded to 7.10 without changing the mappings. However with our recent efforts to reduce the field count we've changed several mappings like (index: false) which means mappings can no longer be updated without a reindex (and we didn't retest our assumption 💥 ).

We could potentially just change the mappings, but I fear that there could be an incompatible mapping change lurking in any of the 6.x.x -> 7.x.x -> 7.11 upgrade paths. So it feels like the least risky option is to just reindex everything.

This has the downside that we're duplicating all data in .kibana on every patch whereas we would previously only do this when required (usually every minor). I feel like this is an acceptable tradeoff:

Users are used to their data growing, it will just grow slightly faster now if they upgrade frequently
Most kibana indices are small and users with large indices are likely already performing cleanup out of necessity
We could optimize the algorithm in a future release to check if any mappings have changed, if not, we can do a clone and if it has we can do a reindex. This would mean that for most patch releases we don't increase the storage costs and our solution has a similar "storage cost" to the old algorithm. However, for the 7.11.0 release this won't make a difference so this optimization can be targeted for 7.12.

3. Reindex + clone to prevent lost deletes

When multiple nodes are performing a reindex we could end up with lost deletes:

Node 1 starts a migration (incl reindex)
Node 2 starts a migration (incl reindex)
Node 1 finishes the migration and accepts a delete operation deleting the saved object from the index
Node 2's reindex will index the deleted document from step (3) without anyway to detect this lost delete

To prevent this lost deleted, I've change the reindex into a reindex + clone step. Deletes will only occur against the cloned index which prevents other reindex operations from adding back the deleted document.

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)
This was checked for cross-browser compatibility

For maintainers

This was checked for breaking API changes and was labeled appropriately

elasticmachine · 2020-11-25T14:40:27Z

Pinging @elastic/kibana-core (Team:Core)

rudolf · 2020-11-25T14:43:14Z

rfcs/text/0013_saved_object_migrations.md

            { "remove_index": { "index": ".kibana" } }
+            { "add":  { "index": ".kibana_pre6.5.0_001", "alias": ".kibana" } },


I don't think the order matters, both operations need to succeed or both fail. But just felt like this order makes it easier to read since we only want to add the alias if remove_index succeeds.

rudolf · 2020-11-25T14:44:23Z

rfcs/text/0013_saved_object_migrations.md

+9.  Mark the migration as complete. This is done as a single atomic operation
+    (requires https://github.com/elastic/elasticsearch/pull/58100) to
+    guarantee when multiple versions of Kibana are performing the migration in
+    parallel, only one version will win. E.g. if 7.11 and 7.12 are started in
+    parallel and migrate from a 7.9 index, either 7.11 or 7.12 should succeed
+    and accept writes, but not both.


I just fixed a grammar mistake guarantees -> guarantee 🤓 no real changes here

kobelb · 2020-11-30T17:59:42Z

rfcs/text/0013_saved_object_migrations.md

+      the legacy index. Use a fixed index name i.e `.kibana_pre6.5.0_001` or
+      `.kibana_task_manager_pre7.4.0_001`. Ignore index already exists errors.
+   3. Reindex the legacy index into the new source index with the
+      `convertToAlias` script if specified. Use `wait_for_completion: false`


Does the convertToAlias script use deterministic IDs already?

Yes, it's deterministic, we're just appending the type infront of the existing ID to make it SO compatible.

Would it be advantageous to only use reindexing when there is a migration with convertToAlias script that needs to be applied?

My understanding is that index cloning had some significant performance advantages for larger indices. This performance difference could be important as we try to accommodate more use cases in the SO index (eg. SIEM exception lists)

The reindex will only be applied once when a user upgrades from a legacy index and after that we'll always be able to clone the index. So this really only affects users the first time they upgrade from < 6.5 and won't have a future impact when in later versions our indices start growing.

It's possible to make this optimization, but it makes the algorithm somewhat more complex so it feels like the simplicity and consistency is worth it for the relatively small performance impact.

rudolf · 2020-12-14T15:02:29Z

rfcs/text/0013_saved_object_migrations.md

-      3. That belong to a type whose mappings were changed by comparing the `migrationMappingPropertyHashes`. (Metadata, unlike the mappings isn't commutative, so there is a small chance that the metadata hashes do not accurately reflect the latest mappings, however, this will just result in an less efficient query).
+6. Create a target index with `dynamic: false` on the top-level mappings so that any kind of document can be written to the index. This allows us to write untransformed documents to the index which might have fields which have been removed from the latest mappings defined by the plugin. Define `dynamic:true` mappings for the `migrationVersion` field so that we're still able to search for outdated documents that need to be transformed.
+   1. Ignore errors if the target index already exists.
+7. Reindex the source index into a the new target index. All nodes on the same version will use the same fixed index name e.g. `.kibana_7.10.0_001`. The `001` postfix isn't used by Kibana, but allows for re-indexing an index should this be required by an Elasticsearch upgrade. E.g. re-index `.kibana_7.10.0_001` into `.kibana_7.10.0_002` and point the `.kibana_7.10.0` alias to `.kibana_7.10.0_002`.


There is a risk of lost deletes here.

Two instances start the migration

Instance (1) finishes the migration
3 Instance (2) is still busy with a reindex

Instance (1) deletes a document (and acknowledges the delete to the client)

Instance (2)'s reindex operation re-creates the deleted document 💣 💥

Solved with reindex + clone

…ost deletes

…g other reindex operations is idempotent The first version of the reindex block had only the instance which was able to mark the migration as complete set and remove the write block. This means other instances couldn't know if any reindex operaitons were in progress if the migration was already marked as complete. It also meant that a failure in this critical step could result in a permanent write block.

…reventing other reindex operations is idempotent" This reverts commit 8baf9b1.

…revent lost deletes" This reverts commit d7237ca.

joshdover

The intermediate temporary index seems to solve the lost deletes problem well. Overall, this LGTM, thanks @rudolf!

rfcs/text/0013_saved_object_migrations.md

Co-authored-by: Josh Dover <[email protected]>

rudolf · 2021-02-01T18:02:22Z

@elasticmachine merge upstream

* master: (99 commits) [Fleet] Use Fleet Server indices in the search bar (elastic#90835) [Search Sessions] added an info flyout to session management (elastic#90559) [ILM] Revisit searchable snapshot field after new redesign (elastic#90793) [Alerting] License Errors on Alert List View (elastic#89920) RFC Improve saved object migrations algorithm (elastic#84333) [Lens] (Accessibility) Fix focus on drag and drop actions (elastic#90561) Use new shortcut links to Fleet discuss forums. (elastic#90786) Do not generate an ephemeral encryption key in production. (elastic#81511) [Fleet] Use staging registry for snapshot builds (elastic#90327) Actually deleting x-pack/tsconfig.refs.json (elastic#90898) Add deprecation warning to all Beats CM pages. (elastic#90741) skip flaky suite (elastic#90136) Revert "Revert "[Metrics UI] Add Metrics Anomaly Alert Type (elastic#89244)"" (elastic#90889) remove ref to removed tsconfig file [core.logging] Uses host timezone as default (elastic#90368) [Maps] remove maps_file_upload plugin and fold public folder into file_upload plugin (elastic#90292) Revert "[Metrics UI] Add Metrics Anomaly Alert Type (elastic#89244)" [dev-utils/ci-stats] support disabling ship errors (elastic#90851) Prefix with / (elastic#90836) [Metrics UI] Add Metrics Anomaly Alert Type (elastic#89244) ...

Instead of cloning, reindex legacy index

24e8f7f

rudolf added Feature:Saved Objects Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Nov 25, 2020

rudolf changed the title ~~Instead of cloning, reindex the legacy index~~ RFC Improve saved object migrations: Instead of cloning, reindex the legacy index Nov 25, 2020

rudolf added backport:skip This commit does not require backporting v8.0.0 release_note:skip Skip the PR/issue when compiling release notes labels Nov 25, 2020

rudolf commented Nov 25, 2020

View reviewed changes

rudolf requested review from kobelb and joshdover November 25, 2020 14:49

rudolf added the project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient label Nov 27, 2020

kobelb reviewed Nov 30, 2020

View reviewed changes

Reindex for every v2 migration

edacd24

rudolf changed the title ~~RFC Improve saved object migrations: Instead of cloning, reindex the legacy index~~ RFC Improve saved object migrations: Instead of cloning, reindex Dec 10, 2020

rudolf commented Dec 14, 2020

View reviewed changes

rudolf added 8 commits December 14, 2020 23:51

Merge branch 'master' into so-migrations-reindex-legacy

e92d90b

Use _reindex?require_alias=true and a write block toggle to prevent l…

d7237ca

…ost deletes

Revert "Use a ..._reindex_in_progress alias so that waiting for and p…

4ef8034

…reventing other reindex operations is idempotent" This reverts commit 8baf9b1.

Revert "Use _reindex?require_alias=true and a write block toggle to p…

8190b65

…revent lost deletes" This reverts commit d7237ca.

Use reindex + clone as a way to prevent lost deletes

d98faaa

Merge branch 'master' into so-migrations-reindex-legacy

09c7e25

Fix numbering and ignore index_not_found_exceptionfor temporary index

699d530

rudolf changed the title ~~RFC Improve saved object migrations: Instead of cloning, reindex~~ RFC Improve saved object migrations algorithm Jan 11, 2021

joshdover approved these changes Jan 12, 2021

View reviewed changes

rfcs/text/0013_saved_object_migrations.md Outdated Show resolved Hide resolved

rfcs/text/0013_saved_object_migrations.md Outdated Show resolved Hide resolved

Apply suggestions from code review

4639e47

Co-authored-by: Josh Dover <[email protected]>

rudolf mentioned this pull request Jan 14, 2021

Handle another node already having deleted the temporary index #88332

Merged

8 tasks

Merge branch 'master' into so-migrations-reindex-legacy

491e0a9

rudolf merged commit ce441bd into elastic:master Feb 10, 2021

rudolf deleted the so-migrations-reindex-legacy branch February 10, 2021 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC Improve saved object migrations algorithm #84333

RFC Improve saved object migrations algorithm #84333

rudolf commented Nov 25, 2020 •

edited

Loading

elasticmachine commented Nov 25, 2020

rudolf Nov 25, 2020

rudolf Nov 25, 2020

kobelb Nov 30, 2020

rudolf Dec 2, 2020

joshdover Dec 2, 2020

rudolf Dec 4, 2020

rudolf Dec 14, 2020

rudolf Jan 11, 2021

joshdover left a comment

rudolf commented Feb 1, 2021

		{ "remove_index": { "index": ".kibana" } }
		{ "add": { "index": ".kibana_pre6.5.0_001", "alias": ".kibana" } },

RFC Improve saved object migrations algorithm #84333

RFC Improve saved object migrations algorithm #84333

Conversation

rudolf commented Nov 25, 2020 • edited Loading

Summary

1. Reindex the legacy index

2. Reindex for all migrations

3. Reindex + clone to prevent lost deletes

Checklist

For maintainers

elasticmachine commented Nov 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshdover left a comment

Choose a reason for hiding this comment

rudolf commented Feb 1, 2021

rudolf commented Nov 25, 2020 •

edited

Loading