Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Improve saved object migrations algorithm #84333

Merged
merged 12 commits into from
Feb 10, 2021

Conversation

rudolf
Copy link
Contributor

@rudolf rudolf commented Nov 25, 2020

Summary

1. Reindex the legacy index

Because the task manager convertToAlias script rewrites documents' _id we cannot clone the legacy and transform the documents with a script, we have to reindex.

2. Reindex for all migrations

We previously tested that mappings updates between versions of Kibana were compatible i.e. the mappings of a 7.1 index can be upgraded to 7.10 without changing the mappings. However with our recent efforts to reduce the field count we've changed several mappings like (index: false) which means mappings can no longer be updated without a reindex (and we didn't retest our assumption 💥 ).

We could potentially just change the mappings, but I fear that there could be an incompatible mapping change lurking in any of the 6.x.x -> 7.x.x -> 7.11 upgrade paths. So it feels like the least risky option is to just reindex everything.

This has the downside that we're duplicating all data in .kibana on every patch whereas we would previously only do this when required (usually every minor). I feel like this is an acceptable tradeoff:

  • Users are used to their data growing, it will just grow slightly faster now if they upgrade frequently
  • Most kibana indices are small and users with large indices are likely already performing cleanup out of necessity
  • We could optimize the algorithm in a future release to check if any mappings have changed, if not, we can do a clone and if it has we can do a reindex. This would mean that for most patch releases we don't increase the storage costs and our solution has a similar "storage cost" to the old algorithm. However, for the 7.11.0 release this won't make a difference so this optimization can be targeted for 7.12.

3. Reindex + clone to prevent lost deletes

When multiple nodes are performing a reindex we could end up with lost deletes:

  1. Node 1 starts a migration (incl reindex)
  2. Node 2 starts a migration (incl reindex)
  3. Node 1 finishes the migration and accepts a delete operation deleting the saved object from the index
  4. Node 2's reindex will index the deleted document from step (3) without anyway to detect this lost delete

To prevent this lost deleted, I've change the reindex into a reindex + clone step. Deletes will only occur against the cloned index which prevents other reindex operations from adding back the deleted document.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@rudolf rudolf added Feature:Saved Objects Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Nov 25, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf rudolf changed the title Instead of cloning, reindex the legacy index RFC Improve saved object migrations: Instead of cloning, reindex the legacy index Nov 25, 2020
@rudolf rudolf added backport:skip This commit does not require backporting v8.0.0 release_note:skip Skip the PR/issue when compiling release notes labels Nov 25, 2020
{ "remove_index": { "index": ".kibana" } }
{ "add": { "index": ".kibana_pre6.5.0_001", "alias": ".kibana" } },
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the order matters, both operations need to succeed or both fail. But just felt like this order makes it easier to read since we only want to add the alias if remove_index succeeds.

Comment on lines 271 to 276
9. Mark the migration as complete. This is done as a single atomic operation
(requires https://github.com/elastic/elasticsearch/pull/58100) to
guarantee when multiple versions of Kibana are performing the migration in
parallel, only one version will win. E.g. if 7.11 and 7.12 are started in
parallel and migrate from a 7.9 index, either 7.11 or 7.12 should succeed
and accept writes, but not both.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just fixed a grammar mistake guarantees -> guarantee 🤓 no real changes here

@rudolf rudolf requested review from kobelb and joshdover November 25, 2020 14:49
@rudolf rudolf added the project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient label Nov 27, 2020
the legacy index. Use a fixed index name i.e `.kibana_pre6.5.0_001` or
`.kibana_task_manager_pre7.4.0_001`. Ignore index already exists errors.
3. Reindex the legacy index into the new source index with the
`convertToAlias` script if specified. Use `wait_for_completion: false`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the convertToAlias script use deterministic IDs already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's deterministic, we're just appending the type infront of the existing ID to make it SO compatible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be advantageous to only use reindexing when there is a migration with convertToAlias script that needs to be applied?

My understanding is that index cloning had some significant performance advantages for larger indices. This performance difference could be important as we try to accommodate more use cases in the SO index (eg. SIEM exception lists)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reindex will only be applied once when a user upgrades from a legacy index and after that we'll always be able to clone the index. So this really only affects users the first time they upgrade from < 6.5 and won't have a future impact when in later versions our indices start growing.

It's possible to make this optimization, but it makes the algorithm somewhat more complex so it feels like the simplicity and consistency is worth it for the relatively small performance impact.

@rudolf rudolf changed the title RFC Improve saved object migrations: Instead of cloning, reindex the legacy index RFC Improve saved object migrations: Instead of cloning, reindex Dec 10, 2020
3. That belong to a type whose mappings were changed by comparing the `migrationMappingPropertyHashes`. (Metadata, unlike the mappings isn't commutative, so there is a small chance that the metadata hashes do not accurately reflect the latest mappings, however, this will just result in an less efficient query).
6. Create a target index with `dynamic: false` on the top-level mappings so that any kind of document can be written to the index. This allows us to write untransformed documents to the index which might have fields which have been removed from the latest mappings defined by the plugin. Define `dynamic:true` mappings for the `migrationVersion` field so that we're still able to search for outdated documents that need to be transformed.
1. Ignore errors if the target index already exists.
7. Reindex the source index into a the new target index. All nodes on the same version will use the same fixed index name e.g. `.kibana_7.10.0_001`. The `001` postfix isn't used by Kibana, but allows for re-indexing an index should this be required by an Elasticsearch upgrade. E.g. re-index `.kibana_7.10.0_001` into `.kibana_7.10.0_002` and point the `.kibana_7.10.0` alias to `.kibana_7.10.0_002`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a risk of lost deletes here.

  1. Two instances start the migration
  2. Instance (1) finishes the migration
    3 Instance (2) is still busy with a reindex
  3. Instance (1) deletes a document (and acknowledges the delete to the client)
  4. Instance (2)'s reindex operation re-creates the deleted document 💣 💥

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved with reindex + clone

…g other reindex operations is idempotent

The first version of the reindex block had only the instance which was able to mark the migration as complete
set and remove the write block. This means other instances couldn't know if any reindex operaitons were in
progress if the migration was already marked as complete. It also meant that a failure in this critical step
could result in a permanent write block.
…reventing other reindex operations is idempotent"

This reverts commit 8baf9b1.
@rudolf rudolf changed the title RFC Improve saved object migrations: Instead of cloning, reindex RFC Improve saved object migrations algorithm Jan 11, 2021
Copy link
Contributor

@joshdover joshdover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intermediate temporary index seems to solve the lost deletes problem well. Overall, this LGTM, thanks @rudolf!

rfcs/text/0013_saved_object_migrations.md Outdated Show resolved Hide resolved
rfcs/text/0013_saved_object_migrations.md Outdated Show resolved Hide resolved
@rudolf
Copy link
Contributor Author

rudolf commented Feb 1, 2021

@elasticmachine merge upstream

@rudolf rudolf merged commit ce441bd into elastic:master Feb 10, 2021
@rudolf rudolf deleted the so-migrations-reindex-legacy branch February 10, 2021 12:55
gmmorris added a commit to gmmorris/kibana that referenced this pull request Feb 10, 2021
* master: (99 commits)
  [Fleet] Use Fleet Server indices in the search bar (elastic#90835)
  [Search Sessions] added an info flyout to session management (elastic#90559)
  [ILM] Revisit searchable snapshot field after new redesign (elastic#90793)
  [Alerting] License Errors on Alert List View (elastic#89920)
  RFC Improve saved object migrations algorithm (elastic#84333)
  [Lens] (Accessibility) Fix focus on drag and drop actions (elastic#90561)
  Use new shortcut links to Fleet discuss forums. (elastic#90786)
  Do not generate an ephemeral encryption key in production. (elastic#81511)
  [Fleet] Use staging registry for snapshot builds (elastic#90327)
  Actually deleting x-pack/tsconfig.refs.json (elastic#90898)
  Add deprecation warning to all Beats CM pages. (elastic#90741)
  skip flaky suite (elastic#90136)
  Revert "Revert "[Metrics UI] Add Metrics Anomaly Alert Type (elastic#89244)"" (elastic#90889)
  remove ref to removed tsconfig file
  [core.logging] Uses host timezone as default (elastic#90368)
  [Maps] remove maps_file_upload plugin and fold public folder into file_upload plugin (elastic#90292)
  Revert "[Metrics UI] Add Metrics Anomaly Alert Type (elastic#89244)"
  [dev-utils/ci-stats] support disabling ship errors (elastic#90851)
  Prefix with / (elastic#90836)
  [Metrics UI] Add Metrics Anomaly Alert Type (elastic#89244)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Saved Objects project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient release_note:skip Skip the PR/issue when compiling release notes Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants