Updating resources on a stack upgrade should be easier #103841

dgieselaar · 2021-06-30T08:07:47Z

As part of the RAC project, we are installing various component and index templates, and creating indices/aliases that use these templates. When we roll out a new version of the stack, some of these templates might have changed, and we need to update the mappings of write indices, and rollover/migrate data when needed.

Currently, our only option is to use the setup or start lifecycle. However, these are executed on every Kibana instance, so any upgrade strategy needs to take into account that several Kibana instances might want to upgrade assets at the same time. We can use a task, but we also need to know when an asset upgrade has been finished, as we need to block write operations until the upgrade has been completed (this might be possible with a task, not sure).

I'd like us to investigate whether we can make this easier, by e.g. providing a hook that is guaranteed to get executed on one Kibana instance, and an afterUpgrade hook that is called on each Kibana instance, or something that allows us to hook into the upgrade process that happens before Kibana starts, in the same vein as the SO migration process.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-06-30T08:07:49Z

Pinging @elastic/kibana-core (Team:Core)

pgayvallet · 2021-07-06T09:22:34Z

Currently, our only option is to use the setup or start lifecycle. However, these are executed on every Kibana instance, so any upgrade strategy needs to take into account that several Kibana instances might want to upgrade assets at the same time

FWIW, this will always be the case. Even if we were to expose a specific API to register upgrade hooks or functions, these functions would have to be implemented in a way that takes into account that multiples Kibana nodes can be performing the operation concurrently. There is currently no synchronization between Kibana instances, and no real way to acquire a 'lock' from ES. The SO migration algorithm has the same problematic.

dgieselaar · 2021-07-06T09:26:13Z

@pgayvallet why will this always be the case? We don't support rolling upgrades, no? So all instances go down, first one to upgrade takes care of the upgrade process, and only when successfully completed the other instances come back up (without executing the upgrade process). What am I missing here?

pgayvallet · 2021-07-06T14:13:20Z

So all instances go down, first one to upgrade takes care of the upgrade process, and only when successfully completed the other instances come back up

This assumption is wrong unfortunately (would be way too easy). All Kibana instances are allowed to boot at the same time during a migration (this is a supported scenario), and we don't have any synchronization mechanism between instances, so each instance do have to take into consideration that other instances can be performing an upgrade at the same time.

I tried to find the document where the whole 'idempotent versus lock' approach discussion occurred a while ago for SO Migv2 to add more context of all the challenges of introducing a lock mechanism, but I couldn't find it. @joshdover @kobelb maybe you have a better memory than I do?

mshustov · 2021-07-07T14:11:42Z

Besides the necessity to introduce a consensus protocol, there is a problem of blocking Kibana start - the very reason we deprecated async lifecycles.

A plugin-specific async operation shouldn't block or prevent (in case of an exception) other Kibana plugins to start.
This problem becomes even more relevant in light of the effort to make most of the plugins not-disable-able #89584.

As part of the RAC project, we are installing various component and index templates, and creating indices/aliases that use these templates. When we roll out a new version of the stack, some of these templates might have changed, and we need to update the mappings of write indices, and rollover/migrate data when needed.

@kobelb Can it benefit from the solution you are designing for the automatic upgrade of the Fleet packages?

joshdover · 2021-07-07T14:58:21Z

I tried to find the document where the whole 'idempotent versus lock' approach discussion occurred a while ago for SO Migv2 to add more context of all the challenges of introducing a lock mechanism, but I couldn't find it. @joshdover @kobelb maybe you have a better memory than I do?

I think this is most complete write-up we have: https://github.com/elastic/kibana/blob/master/rfcs/text/0013_saved_object_migrations.md#52-single-node-migrations-coordinated-through-a-leaselock

Essentially, it's impossible to build a bullet-proof lease/lock on top of Elasticsearch as it is. So in order to use a lock, we'd need to either add Kibana node clustering & master election or work with the Elasticsearch team to provide a first-class lock mechanism.

When we roll out a new version of the stack, some of these templates might have changed, and we need to update the mappings of write indices, and rollover/migrate data when needed.

Given the above, I'm curious which of these operations would be problematic to build in an idempotent way that could be run on all Kibana nodes during start at once.

Since all nodes should be writing the same mappings & templates, I don't see any issue with them overriding one another for the mapping and template updates.
- Like @mshustov, I believe it may make sense to include this in a Fleet package that can be handled by the upgrade mechanism being worked on over there. I don't believe they have a solution for reindexing old data though.
Migrating old data should be achievable with a scripted reindex. You can use conflicts=proceed so that if multiple nodes are running the reindex at once, conflicts are just ignored.

Also I'm not sure what's in these specific indices. Is this append-only immutable data, or are these stateful mutable documents? If it's the former, reindexing like this should be pretty safe and straightforward, otherwise some more thought will need to be put into reindexing this data.

kobelb · 2021-07-09T18:11:26Z

Building on what @joshdover articulated, ideally, we'd be able to run these migrations scripts "exactly once". However, this is a hard problem to solve when working with a distributed system. In this situation, we have a distributed system because Kibana controls the API calls that need to be made against Elasticsearch.

One of the common tricks to getting "exactly once" semantics is to couple "at least once" with idempotent operations. This is conceptually what @joshdover is recommending above. In this situation, we want Kibana to perform the migrations "at least once" but we need idempotent operations in Elasticsearch to ensure that even though these API calls might be made multiple times, they cause Elasticsearch to be in the same state as if they were only made once.

Kibana can lazily achieve "at least once" by executing the code on literally every start-up operation, and this is possible right now. However, we can consider adding some optimizations to Kibana to make this more efficient and once we have a successful completion, no longer execute this code. This is really just a performance optimization though, as we'll need to anticipate multiple Kibana instances running the migration code in parallel and multiple times consecutively.

pgayvallet · 2024-07-05T07:52:58Z

Even though the problem is still present, the constraints around its resolutions drastically changed (we do need to support rolling upgrades with serverless now), and AFAIK such needs must be addressed on a case-by-case basis (and we do have a few issues opens for specific needs).

I'll go ahead and close this, feel free to reopen with the updated requierements if necessary.

dgieselaar added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Jun 30, 2021

dgieselaar mentioned this issue Jun 30, 2021

[RAC] Add mapping update logic to RuleDataClient #102586

Merged

11 tasks

pgayvallet added the discuss label Jul 6, 2021

pgayvallet closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating resources on a stack upgrade should be easier #103841

Updating resources on a stack upgrade should be easier #103841

dgieselaar commented Jun 30, 2021

elasticmachine commented Jun 30, 2021

pgayvallet commented Jul 6, 2021 •

edited

Loading

dgieselaar commented Jul 6, 2021

pgayvallet commented Jul 6, 2021 •

edited

Loading

mshustov commented Jul 7, 2021

joshdover commented Jul 7, 2021

kobelb commented Jul 9, 2021

pgayvallet commented Jul 5, 2024

Updating resources on a stack upgrade should be easier #103841

Updating resources on a stack upgrade should be easier #103841

Comments

dgieselaar commented Jun 30, 2021

elasticmachine commented Jun 30, 2021

pgayvallet commented Jul 6, 2021 • edited Loading

dgieselaar commented Jul 6, 2021

pgayvallet commented Jul 6, 2021 • edited Loading

mshustov commented Jul 7, 2021

joshdover commented Jul 7, 2021

kobelb commented Jul 9, 2021

pgayvallet commented Jul 5, 2024

pgayvallet commented Jul 6, 2021 •

edited

Loading

pgayvallet commented Jul 6, 2021 •

edited

Loading