[ML] Possibility to end up with .ml-state-write alias on multiple indices #68925

droberts195 · 2021-02-11T17:36:25Z

During QA upgrade testing from 6.8.14 to 7.11.1 we saw an error occur when we tried to have the .ml-state-write alias pointing at both .ml-state and .ml-state-000001. This is the relevant section of the Elasticsearch log:

It looks like the code in MlIndexAndAlias.createIndexAndAliasIfNecessary is working on the basis that no indices exist that match the pattern .ml-state*. The reason for this is the resolver.concreteIndexNames(clusterState, IndicesOptions.lenientExpandOpen(), indexPattern) call combined with the fact that the .ml-state index is temporarily unavailable during the upgrade. Lenient expand open deliberately ignores unavailable indices. For the purposes of determining whether an alias is already present on an index we should not ignore unavailable indices.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-02-11T17:36:27Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2021-02-19T13:21:15Z

The reason for this is the resolver.concreteIndexNames(clusterState, IndicesOptions.lenientExpandOpen(), indexPattern) call combined with the fact that the .ml-state index is temporarily unavailable during the upgrade. Lenient expand open deliberately ignores unavailable indices.

On further investigation this analysis is wrong.

The reason for the problem is that 3 threads simultaneously try to create the index, because 3 jobs need to be restarted on the node after the upgrade. This can be seen from the 3 About to create first concrete index log lines. We do account for this possibility by ignoring ResourceAlreadyExists exceptions. But after that the logic for adjusting the alias assumes that it doesn't need moving.

…69039) When multiple jobs start up together on a node following an upgrade, each one of them will trigger a check that the .ml-state* indices are as expected and the .ml-state-write alias points to the correct index. There were a couple of flaws in the logic: 1. We were not considering the possibility that one or more existing .ml-state* indices might be hidden. 2. If multiple jobs tried to create a .ml-state-000001 index simultaneously all but the first would fail. We accounted for this, but then did not follow up with the correct alias update request for those index creation requests that failed. This could cause all but one of the jobs starting up on the node to spuriously fail. Both these problems are fixed by this PR. Fixes #68925

droberts195 added >bug :ml Machine learning labels Feb 11, 2021

elasticmachine added the Team:ML Meta label for the ML team label Feb 11, 2021

droberts195 mentioned this issue Feb 16, 2021

[ML] Fix logic for moving .ml-state-write alias from legacy to new #69039

Merged

droberts195 closed this as completed in #69039 Feb 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Possibility to end up with .ml-state-write alias on multiple indices #68925

[ML] Possibility to end up with .ml-state-write alias on multiple indices #68925

droberts195 commented Feb 11, 2021

elasticmachine commented Feb 11, 2021

droberts195 commented Feb 19, 2021

[ML] Possibility to end up with .ml-state-write alias on multiple indices #68925

[ML] Possibility to end up with .ml-state-write alias on multiple indices #68925

Comments

droberts195 commented Feb 11, 2021

elasticmachine commented Feb 11, 2021

droberts195 commented Feb 19, 2021