Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Possibility to end up with .ml-state-write alias on multiple indices #68925

Closed
droberts195 opened this issue Feb 11, 2021 · 2 comments · Fixed by #69039
Closed

[ML] Possibility to end up with .ml-state-write alias on multiple indices #68925

droberts195 opened this issue Feb 11, 2021 · 2 comments · Fixed by #69039
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team

Comments

@droberts195
Copy link
Contributor

During QA upgrade testing from 6.8.14 to 7.11.1 we saw an error occur when we tried to have the .ml-state-write alias pointing at both .ml-state and .ml-state-000001. This is the relevant section of the Elasticsearch log:

Screenshot 2021-02-11 at 17 28 21

It looks like the code in MlIndexAndAlias.createIndexAndAliasIfNecessary is working on the basis that no indices exist that match the pattern .ml-state*. The reason for this is the resolver.concreteIndexNames(clusterState, IndicesOptions.lenientExpandOpen(), indexPattern) call combined with the fact that the .ml-state index is temporarily unavailable during the upgrade. Lenient expand open deliberately ignores unavailable indices. For the purposes of determining whether an alias is already present on an index we should not ignore unavailable indices.

@droberts195 droberts195 added >bug :ml Machine learning labels Feb 11, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Feb 11, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor Author

The reason for this is the resolver.concreteIndexNames(clusterState, IndicesOptions.lenientExpandOpen(), indexPattern) call combined with the fact that the .ml-state index is temporarily unavailable during the upgrade. Lenient expand open deliberately ignores unavailable indices.

On further investigation this analysis is wrong.

The reason for the problem is that 3 threads simultaneously try to create the index, because 3 jobs need to be restarted on the node after the upgrade. This can be seen from the 3 About to create first concrete index log lines. We do account for this possibility by ignoring ResourceAlreadyExists exceptions. But after that the logic for adjusting the alias assumes that it doesn't need moving.

droberts195 added a commit that referenced this issue Feb 19, 2021
…69039)

When multiple jobs start up together on a node following
an upgrade, each one of them will trigger a check that the
.ml-state* indices are as expected and the .ml-state-write
alias points to the correct index.

There were a couple of flaws in the logic:

1. We were not considering the possibility that one or more
   existing .ml-state* indices might be hidden.
2. If multiple jobs tried to create a .ml-state-000001 index
   simultaneously all but the first would fail.  We accounted
   for this, but then did not follow up with the correct alias
   update request for those index creation requests that
   failed.  This could cause all but one of the jobs starting
   up on the node to spuriously fail.

Both these problems are fixed by this PR.

Fixes #68925
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants