Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] upgraded_cluster/30_ml_jobs_crud/Test open old jobs failed with .ml-state-write not an alias #59011

Closed
droberts195 opened this issue Jul 3, 2020 · 5 comments · Fixed by #59027
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

Build scan: https://gradle-enterprise.elastic.co/s/svde5ts33nfki

Repro line:

./gradlew ':x-pack:qa:rolling-upgrade:v7.6.2#upgradedClusterTest' --tests "org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=upgraded_cluster/30_ml_jobs_crud/Test open old jobs}" \
  -Dtests.seed=95C888B6E2499870 \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-EG \
  -Dtests.timezone=Pacific/Gambier \
  -Druntime.java=8 \
  -Dtests.rest.suite=upgraded_cluster

Reproduces locally?: No

Applicable branches: 7.8 (upgraded from 7.6)

Failure history:

https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:'2020-03-03T13:01:09.914Z',mode:absolute,to:'2020-07-03T12:01:15.600Z'))&_a=(columns:!(_source),index:b646ed00-7efc-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'%22Test%20open%20old%20jobs%22'),sort:!(process.time-start,desc))

There are many failures of this test over the years, but it has failed for many completely different reasons. The earliest I could find with this problem was on 5th March 2020.

Failure excerpt:

java.lang.AssertionError: Failure at [upgraded_cluster/30_ml_jobs_crud:110]: field [] doesn't have a true value
   Expected: not a string equal to "false" ignoring case |  
   but: was "false"

The relevant bit of the YAML file is:

  - do:
      indices.exists_alias:
        name: ".ml-state-write"
  - is_true: ''

We have seen the problem of .ml-state-write being a concrete index rather than an alias in clusters outside of CI. It seems that, very occasionally, it can happen in CI too.

@droberts195 droberts195 added >test-failure Triaged test failures from CI :ml Machine learning labels Jul 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor Author

Interestingly, in x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/old_cluster/30_ml_jobs_crud.yml we have this:

setup:
  - do:
      index:
        index: .ml-state
        id: "dummy-document-to-make-index-creation-idempotent"
        body: >
          {
          }

  - do:
      cluster.health:
        index: [".ml-state"]
        wait_for_status: green

That strikes me as very dodgy, because it implies that our production code couldn't cope with possible race conditions with the creation of the .ml-state index.

More recently this won't even help, which probably explains why these tests sometimes fail now. The index we are actually using is .ml-state-000001 with its write alias .ml-state-write. .ml-state is tolerated as a possible leftover from old versions, but would not be expected to have the write alias .ml-state-write pointing at it in 7.7 and above.

Theory for what happened:

  • Start off in 7.6. .ml-state gets created and we wait for green status, so its replica is assigned and the index is guaranteed to survive a single node being replaced.
  • Upgrade some nodes to 7.8, but at least one still on 7.6. One of the 7.8 nodes decides to create .ml-state-000001 and moves alias .ml-state-write to point at it. The node that takes the primary is a 7.6 node. The replica still hasn't been assigned.
  • Upgrade remaining nodes to 7.8. The 7.6 node that took the .ml-state-000001 index is replaced with a 7.8 node. Since the replica had not yet been assigned the entire .ml-state-000001 index is lost. Since the alias is stored with the index, the .ml-state-write alias is also lost.

@droberts195
Copy link
Contributor Author

We also have quite a few open issues for "all shards failed" against our internal state or stats indices: #54887, #55221, #55807, #57102, and #58841

I am thinking that MlIndexAndAlias.createIndexAndAliasIfNecessary should wait for shards to be allocated for the index it creates before returning. I will create a PR for this.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jul 3, 2020
There have been a few test failures that are likely caused by tests
performing actions that use ML indices immediately after the actions
that create those ML indices.  Currently this can result in attempts
to search the newly created index before its shards have initialized.

This change makes the method that creates the internal ML indices
that have been affected by this problem (state and stats) wait for
the shards to be initialized before returning.

Fixes elastic#54887
Fixes elastic#55221
Fixes elastic#55807
Fixes elastic#57102
Fixes elastic#58841
Fixes elastic#59011
@przemekwitek
Copy link
Contributor

That strikes me as very dodgy, because it implies that our production code couldn't cope with possible race conditions with the creation of the .ml-state index.

I've added this setup stanza when I was renaming .ml-state to .ml-state-000001 (see https://github.com/elastic/elasticsearch/pull/52510/files#diff-90e5b005c474a5a63dafd0aa6ce50cbb). Back then I thought the realistic setup for the old (i.e. before upgrade) cluster was to have the .ml-state index created from the template and in a green state.

@droberts195
Copy link
Contributor Author

Back then I thought the realistic setup for the old (i.e. before upgrade) cluster was to have the .ml-state index created from the template and in a green state.

Oh, I see. Yes, I guess it does make sense in the case where the old cluster is on a version that will create .ml-state-000001 itself, because it simulates the case of a .ml-state left over from a much older version. So on the master branch it's fine. But the complication is that in the 7.x/7.8/7.7 branches sometimes the old cluster is going to be on a version that uses .ml-state itself, and in this case the extra setup code is making the old cluster go through a different code path to what it would have gone through if it had created the index itself. The interesting thing is that the setup of the .ml-state index for the old cluster creates the index but not the alias. So this forces the old cluster code to go through the path where it just creates the alias, not the index. Usually when starting from scratch it would create the index and alias simultaneously. So I bet what this failure shows is that we have a race condition on the code path where we need to create the .ml-state-write alias but not the .ml-state index. (The .ml-state-write alias predates ILM on the state index - it was added in 6.7 by #37483.)

droberts195 added a commit that referenced this issue Jul 6, 2020
…#59027)

There have been a few test failures that are likely caused by tests
performing actions that use ML indices immediately after the actions
that create those ML indices.  Currently this can result in attempts
to search the newly created index before its shards have initialized.

This change makes the method that creates the internal ML indices
that have been affected by this problem (state and stats) wait for
the shards to be initialized before returning.

Fixes #54887
Fixes #55221
Fixes #55807
Fixes #57102
Fixes #58841
Fixes #59011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants