[Migrations] Update all aliases with a single updateAliases() when relocating SO documents #158940

gsoldevila · 2023-06-02T15:13:21Z

The goal of this modification is to enforce migrators of all indices involved in a relocation (e.g. as part of the dot kibana split) to create the index aliases in the same updateAliases() call.

This way, either:

all the indices involved in the dot kibana split relocation will be completely upgraded (with the appropriate aliases).
or none of them will.

gsoldevila · 2023-06-02T15:57:05Z

packages/core/saved-objects/core-saved-objects-migration-server-internal/src/next.ts

@@ -242,6 +249,12 @@ export const nextActionMap = (
      }),
    MARK_VERSION_INDEX_READY: (state: MarkVersionIndexReady) =>
      Actions.updateAliases({ client, aliasActions: state.versionIndexReadyActions.value }),
+    MARK_VERSION_INDEX_READY_SYNC: (state: MarkVersionIndexReady) =>


Here's where the synchronization magic happens:

All migrators involved in a relocation will wait for the rest to reach this point.

They will then provide the aliases that they intend to update (in the payload property).

The updateRelocationAliases synchronisation object will then call the client.indices.updateAliases one single time.

Each of the migrators will receive the same response for the global update.

gsoldevila · 2023-06-02T15:57:55Z

packages/core/saved-objects/core-saved-objects-migration-server-internal/src/model/model.ts

@@ -1458,7 +1470,9 @@ export const model = (currentState: State, resW: ResponseType<AllActionStates>):
      // index.
      return {
        ...stateP,
-        controlState: 'MARK_VERSION_INDEX_READY',
+        controlState: stateP.mustRelocateDocuments
+          ? 'MARK_VERSION_INDEX_READY_SYNC'


All migrators involved in a relocation will update the aliases simultaneously.

gsoldevila · 2023-06-02T16:00:21Z

...core/saved-objects/core-saved-objects-migration-server-internal/src/kibana_migrator_utils.ts

+  return new Defer<T>();
+}
+
+export function createWaitGroupMap<T, U>(


Renamed all "defer" objects to "waitGroups" following Pierre's suggestion (less confusing).

afharo

LGTM! However, I'm not a migration expert yet. So I'd leave the LGTM to someone else :)

afharo · 2023-06-02T15:55:42Z

packages/core/saved-objects/core-saved-objects-migration-server-internal/src/model/model.ts

+          throwDelayMillis: 1000, // another migrator has failed for a reason, let it take Kibana down and log its problem
+        };
+      } else {
+        throwBadResponse(stateP, left);


looking at the line below, should this be return throwBadResponse(...) (mind the return)?

Good point! I didn't notice, but in the code there's a few of each. I suppose it does not matter, cause the throwBadResponse throws an error anyway. We can choose one and make it consistent.

with the tight deadline, I think we should rather not try to decide on what form we prefer and make it consistent in this PR, typescript has us covered either way.

afharo · 2023-06-02T15:58:21Z

packages/core/saved-objects/core-saved-objects-migration-server-internal/src/next.ts

@@ -242,6 +249,12 @@ export const nextActionMap = (
      }),
    MARK_VERSION_INDEX_READY: (state: MarkVersionIndexReady) =>
      Actions.updateAliases({ client, aliasActions: state.versionIndexReadyActions.value }),
+    MARK_VERSION_INDEX_READY_SYNC: (state: MarkVersionIndexReady) =>
+      Actions.synchronizeMigrators({
+        waitGroup: updateRelocationAliases,


Q: Could it happen that 1 alias creation fails (while the others succeed) and it triggers a whole re-run of the migrations on the next restart?

Should we create all aliases in the same call?

afharo · 2023-06-02T16:00:41Z

...ages/core/saved-objects/core-saved-objects-migration-server-internal/src/run_v2_migration.ts

+      updateAliases({
+        client: options.elasticsearchClient,
+        aliasActions: allAliasActions.flat(),
+      })()


oh! this is the response to my previous comment, right? 😅

Yes, that's correct!

There's a single call to update them all.
In fact, they all await on something like:

(Promise.all([migrator1, migrator2, migratorN]).then(updateAliases))

This is one of the simplest solutions I came up with, but perhaps we can create a separate issue to refactor and make something more centralised, some sort of SynchronizationManager.

I had the same "Oh!" moment 😅 I would not come here to look for an updateAliases action call. I think it's fine to merge as-is and rather spend extra time on testing.

But I think it would be worth exploring if we could make this less "surprising". One way might be to let every migrator call the updateAliases action. Before we had 6 update aliases calls with each call doing one index/alias. Now we'd have 1 batch call and 5 no-ops.

Another option could be to use the synchronizeMigrators.then hook but only let the .kibana migrator do the updateAliases call, other migrators have a no-op then hook.

gsoldevila · 2023-06-02T16:03:04Z

packages/core/saved-objects/core-saved-objects-migration-server-internal/src/model/model.ts

@@ -1474,9 +1488,19 @@ export const model = (currentState: State, resW: ResponseType<AllActionStates>):
  } else if (stateP.controlState === 'CREATE_NEW_TARGET') {
    const res = resW as ExcludeRetryableEsError<ResponseType<typeof stateP.controlState>>;
    if (Either.isRight(res)) {
+      if (res.right === 'index_already_exists') {


This the second fix of the PR:
If we are creating a new index but the index already exists:

it probably belongs to a previous failed upgrade

which managed to create the index

but failed before it could create the aliases

In this scenario, instead of simply completing the migration, we will attempt to update mappings first.

I was wondering why index_already_exists was a right response.
Strictly speaking, if the index exists, and it shouldn't, then the previous migration failure would be an error state, i.e. a left response). But I see what we're doing here, migrations skip over index creation and move on. Hence, it's a recoverable 'unexpected' case.
It makes sense now, great!

gsoldevila · 2023-06-02T16:25:57Z

...ore/saved-objects/core-saved-objects-migration-server-internal/src/actions/update_aliases.ts

  () => {
+    if (!aliasActions || !aliasActions.length) throw Error('updating NO aliases!');


Oops, that was a debug statement, removing!

lukeelmers

Let's get some more eyes on it, but the sync process and the handling for index_already_exists all make sense to me. Thanks for jumping on this so quickly @gsoldevila ❤️

lukeelmers · 2023-06-02T22:00:02Z

packages/core/saved-objects/core-saved-objects-migration-server-internal/src/model/model.ts

        return { ...stateP, controlState: 'LEGACY_CREATE_REINDEX_TARGET' };
      } else {
-        // @ts-expect-error TS doesn't correctly narrow this type to never
-        return throwBadResponse(stateP, res);
+        throwBadResponse(stateP, left);


I assume it was just a mistake that we were previously throwing res here instead of res.left, and that we aren't losing anything important in our error?

In this branch of the conditional, res should not contain anything apart from left.

I looked at the code that can throw this exception:
packages/core/saved-objects/core-saved-objects-migration-server-internal/src/actions/set_write_block.ts

And I can confirm, the only possible Left responses are IndexNotFound | RetryableEsClientError, none of them have anything else.
Finally, the throwBadResponse only JSON.stringifies the received object, so I believe it's completely safe to delete.

yeah we're just helping TS see what we already know by giving it the const left = res.left; hint.

TinaHeiligers

Changes LGTM.

kibana-ci · 2023-06-04T16:40:51Z

💚 Build Succeeded

Buildkite Build
Commit: 23b4e39

Metrics [docs]

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id	before	after	diff
`@kbn/core-saved-objects-migration-server-internal`	45	46	+1

Unknown metric groups

ESLint disabled line counts

id	before	after	diff
`enterpriseSearch`	19	21	+2
`securitySolution`	414	418	+4
total			+6

Total ESLint disabled count

id	before	after	diff
`enterpriseSearch`	20	22	+2
`securitySolution`	498	502	+4
total			+6

History

💚 Build #132383 succeeded 3c50588
💚 Build #132357 succeeded fa263ee
💔 Build #132288 failed 13e9828
💔 Build #132189 failed 4d2619f
💔 Build #132165 failed b539b1a

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

This PR addresses remarks and feedback from #158940, which was part of an emergency release.

This PR adds #158733 to the list of known issues: * issue: #158733 * pull: #158940 --------- Co-authored-by: James Rodewig <[email protected]>

#159221) # Backport This will backport the following commits from `8.8` to `main`: - [[DOCS+] Add #158940 to the list of 8.8.0 known issues (#159197)](#159197)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Gerard Soldevila <[email protected]>

kibanamachine · 2024-05-22T21:48:47Z