Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] MlJobSnapshotUpgradeIT.testSnapshotUpgrader #69276

Closed
pugnascotia opened this issue Feb 19, 2021 · 9 comments · Fixed by #80347
Closed

[CI] MlJobSnapshotUpgradeIT.testSnapshotUpgrader #69276

pugnascotia opened this issue Feb 19, 2021 · 9 comments · Fixed by #80347
Assignees
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@pugnascotia
Copy link
Contributor

Build scan:

https://gradle-enterprise.elastic.co/s/eudgfems3awkq

Repro line:

./gradlew ':x-pack:qa:rolling-upgrade:v7.12.0#upgradedClusterTest' -Dtests.class="org.elasticsearch.upgrades.MlJobSnapshotUpgradeIT" -Dtests.method="testSnapshotUpgrader" -Dtests.seed=97DF6D86FBC6515D -Dtests.security.manager=true -Dtests.bwc=true -Dtests.locale=en-CA -Dtests.timezone=Africa/Nouakchott -Druntime.java=8

Reproduces locally?:

No

Applicable branches:

7.x

Failure history:

https://gradle-enterprise.elastic.co/scans/tests?search.relativeStartTime=P7D&search.timeZoneId=Europe/London&tests.container=org.elasticsearch.upgrades.MlJobSnapshotUpgradeIT&tests.sortField=FAILED&tests.test=testSnapshotUpgrader&tests.unstableOnly=true

Failure excerpt:

org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]
@pugnascotia pugnascotia added >test-failure Triaged test failures from CI :ml Machine learning v7.12.1 labels Feb 19, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Feb 19, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor

The exception occurred at 13:10:20:

13:10:20 org.elasticsearch.upgrades.MlJobSnapshotUpgradeIT > testSnapshotUpgrader FAILED
13:10:20     ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]]

But the original upgrade was started at 13:09:20:

[2021-02-19T13:09:20,557][INFO ][o.e.x.m.a.TransportUpgradeJobModelSnapshotAction] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] sending start upgrade request
[2021-02-19T13:09:20,592][DEBUG][o.e.x.m.j.JobNodeSelector] [v7.12.0-0] Falling back to allocating job [ml-snapshots-upgrade-job] by job counts because its memory requirement was not available
[2021-02-19T13:09:20,592][DEBUG][o.e.x.m.j.JobNodeSelector] [v7.12.0-0] selected node [{v7.12.0-0}{x3-GV25RSgOwS_Snau5PVg}{hPtI9NQQTd2mMzqUaBUQWQ}{127.0.0.1}{127.0.0.1:44533}{cdfhilmrstw}{ml.machine_memory=101362774016, upgraded=true, xpack.installed=true, transform.node=true, testattr=test, ml.max_open_jobs=20, ml.max_jvm_size=518979584}] for job [ml-snapshots-upgrade-job]
[2021-02-19T13:09:20,666][INFO ][o.e.x.m.j.s.u.SnapshotUpgradeTaskExecutor] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] starting to execute task
[2021-02-19T13:09:20,704][INFO ][o.e.x.m.j.p.a.NativeAutodetectProcessFactory] [v7.12.0-0] Restoring quantiles for job 'ml-snapshots-upgrade-job'
[2021-02-19T13:09:20,705][DEBUG][o.e.x.m.p.NativeController] [v7.12.0-0] Command [1]: starting process with command [./autodetect, --lengthEncodedInput, --maxAnomalyRecords=500, --config=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/config1045921840463316457.json, --quantilesState=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/ml-snapshots-upgrade-job_quantiles_2348509639247300197080.json, --deleteStateFiles, --logPipe=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/autodetect_ml-snapshots-upgrade-job-1613739961_log_614402, --input=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/autodetect_ml-snapshots-upgrade-job-1613739961_input_614402, --inputIsPipe, --output=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/autodetect_ml-snapshots-upgrade-job-1613739961_output_614402, --outputIsPipe, --restore=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/autodetect_ml-snapshots-upgrade-job-1613739961_restore_614402, --restoreIsPipe, --persist=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/autodetect_ml-snapshots-upgrade-job-1613739961_persist_614402, --persistIsPipe, --namedPipeConnectTimeout=10]
[2021-02-19T13:09:21,039][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CLogger.cc@333] Logger is logging to named pipe /var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/autodetect_ml-snapshots-upgrade-job-1613739961_log_614402
[2021-02-19T13:09:21,040][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [Main.cc@150] autodetect (64 bit): Version 7.13.0-SNAPSHOT (Build a113c9756dfdff) Copyright (c) 2021 Elasticsearch BV
[2021-02-19T13:09:21,040][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CProcessPriority_Linux.cc@33] Successfully increased OOM killer adjustment via /proc/self/oom_score_adj
[2021-02-19T13:09:21,040][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CSystemCallFilter_Linux.cc@141] Seccomp BPF filters available
[2021-02-19T13:09:21,041][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CSystemCallFilter_Linux.cc@167] Seccomp BPF installed
[2021-02-19T13:09:21,123][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CResourceMonitor.cc@77] Setting model memory limit to 1024 MB
[2021-02-19T13:09:21,123][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CLengthEncodedInputParser.cc@48] Length encoded input parser input is not connected to stdin
[2021-02-19T13:09:21,123][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CHierarchicalResultsNormalizer.cc@290] 0 influencer bucket normalizers, 0 influencer normalizers, 1 partition normalizers, 0 person normalizers and 1 leaf normalizers restored from JSON stream
[2021-02-19T13:09:21,134][TRACE][o.e.x.m.j.p.StateStreamer] [v7.12.0-0] ES API CALL: get ID ml-snapshots-upgrade-job_model_state_1613739961#1 from index .ml-state*
[2021-02-19T13:09:21,142][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CStateDecompressor.cc@169] Explicit end-of-stream marker found in document with index 1
[2021-02-19T13:09:21,142][TRACE][o.e.x.m.j.p.StateStreamer] [v7.12.0-0] ES API CALL: get ID ml-snapshots-upgrade-job_categorizer_state#1 from index .ml-state*
[2021-02-19T13:09:21,142][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@893] Processing is already complete to time 1491037200
[2021-02-19T13:09:21,144][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyDetector.cc@115] CAnomalyDetector(): count by count for '', first time = 0, bucketLength = 3600, m_LastBucketEndTime = 0
[2021-02-19T13:09:21,144][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@1043] Restoring state for detector with key '0==individual count/1/0//count////'
[2021-02-19T13:09:21,144][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyDetector.cc@115] CAnomalyDetector(): mean value partition=series/foo for 'foo', first time = 0, bucketLength = 3600, m_LastBucketEndTime = 0
[2021-02-19T13:09:21,145][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@1043] Restoring state for detector with key '0==individual metric mean/0/0/value///series//foo'
[2021-02-19T13:09:21,146][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@829] Finished restoration, with 2 detectors
[2021-02-19T13:09:21,146][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@850] Setting lastBucketEndTime to 1491037200 in detector for 'mean value partition=series/foo'
[2021-02-19T13:09:21,147][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@850] Setting lastBucketEndTime to 1491037200 in detector for 'count by count'
[2021-02-19T13:09:21,156][DEBUG][o.e.x.m.j.p.a.JobModelSnapshotUpgrader] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] waiting for flush [1]
[2021-02-19T13:09:21,156][DEBUG][o.e.x.m.j.p.a.o.JobSnapshotUpgraderResultProcessor] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] Flush acknowledgement parsed from output for ID 1
[2021-02-19T13:09:21,157][DEBUG][o.e.x.m.j.p.a.JobModelSnapshotUpgrader] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] flush completed [1]
[2021-02-19T13:09:21,158][DEBUG][o.e.x.m.j.p.a.JobModelSnapshotUpgrader] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] flush [1] acknowledged requesting state write
[2021-02-19T13:09:21,235][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@1291] Persisted state for 'count by count'
[2021-02-19T13:09:21,235][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@1291] Persisted state for 'mean value partition=series/foo'
[2021-02-19T13:09:21,237][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CModelSnapshotJsonWriter.cc@81] Wrote model snapshot report with ID 1613739961 for: State persisted due to job close at 2021-02-19T13:06:01+0000, latest final results at 1491033600
[2021-02-19T13:09:21,240][TRACE][o.e.x.m.j.p.JobResultsPersister] [v7.12.0-0] [ml-snapshots-upgrade-job] ES API CALL: to index .ml-anomalies-.write-ml-snapshots-upgrade-job with ID [ml-snapshots-upgrade-job_model_snapshot_1613739961]
[2021-02-19T13:09:21,251][TRACE][o.e.x.m.p.IndexingStateProcessor] [v7.12.0-0] [ml-snapshots-upgrade-job] Persisting job state document: index [.ml-state-000001], length [8006]

So it looks like:

  1. The first upgrade task didn't end despite appearing to have done everything it needed to do
  2. The test harness retried the request after 60 seconds, and this unexpected retry generated the exception that failed the test

@BigPandaToo
Copy link
Contributor

Looks like the same issue in 7.12: https://gradle-enterprise.elastic.co/s/oi3b5ucjmpnw4

@cbuescher
Copy link
Member

Another one today on 7.13: https://gradle-enterprise.elastic.co/s/7n2o6ldfztrwk

@joegallo joegallo removed the v7.12.2 label Jun 30, 2021
@danhermann
Copy link
Contributor

Another instance in 7.x: https://gradle-enterprise.elastic.co/s/yko2vg2p5pkko

@benwtrent benwtrent self-assigned this Nov 4, 2021
@benwtrent
Copy link
Member

benwtrent commented Nov 4, 2021

OK, I have confirmed robert's theory:

[2021-11-04T03:40:42,127][TRACE][o.e.x.m.j.p.JobResultsPersister] [v7.10.2-2] [ml-snapshots-upgrade-job] ES API CALL: to index .ml-anomalies-.write-ml-snapshots-upgrade-job with ID [ml-snapshots-upgrade-job_model_snapshot_1635997051]
[2021-11-04T03:40:42,140][TRACE][o.e.x.m.p.IndexingStateProcessor] [v7.10.2-2] [ml-snapshots-upgrade-job] Persisting job state document: index [.ml-state-000001], length [14017]
[2021-11-04T03:42:12,302][INFO ][o.e.c.c.Coordinator      ] [v7.10.2-2] master node [{v7.10.2-0}{3tEugoPCR5u5aqWpQSQtKg}{jwTqX71yTUS-ogyPvTVPrw}{127.0.0.1}{127.0.0.1:41235}{cdfhilmrstw}{ml.machine_memory=101252943872, upgraded=true, xpack.installed=true, t

NOTE the time of the state document being persisted was [2021-11-04T03:40:42,140]

API call was attempted again in this most recent test run at [2021-11-04T00:42:11,848] (almost a whole minute after, basically right at 90s).

Logs from the master node show the task starting at the 41 minute mark (lines up with the assigned node)

[2021-11-04T03:40:41,602][INFO ][o.e.x.m.a.TransportUpgradeJobModelSnapshotAction] [v7.10.2-0] [ml-snapshots-upgrade-job] [1635997051] sending start upgrade request
[2021-11-04T03:40:41,603][DEBUG][o.e.x.m.j.JobNodeSelector] [v7.10.2-0] selected node [{v7.10.2-2}{6YADyR23Q_qxiaOw9Izv2w}{5Bo4iGtYQWObWhcWzY1feA}{127.0.0.1}{127.0.0.1:41481}{cdfhilmrstw}{ml.machine_memory=101252943872, upgraded=true, xpack.installed=true, transform.node=true, testattr=test, ml.max_open_jobs=512, ml.max_jvm_size=518979584}] for job [ml-snapshots-upgrade-job]
[2021-11-04T03:42:12.252616337Z] [BUILD] Stopping node

There are no other logs that indicate any weird behavior.

This leaves the following options:

  • The process isn't finishing once the snapshot was stored
  • The listener isn't being notified once the process is finished

I will go through with a fine-toothed comb to see what turns up.

The task close handler was never called. It logs finished upgrading snapshot when its called, but that log message isn't anywhere.

@benwtrent
Copy link
Member

Looking at the logs and the code. The native process received the request to write the new state, and the new state was persisted.

So, we made it to this part in the task:

https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/JobModelSnapshotUpgrader.java#L278-L289

State is persisted, so the callback should have fired on the utility thread.

It may be we are tying up an executor and that causes an eventual timeout + failure that isn't logged. Or, waiting for the job to finish is actually taking that long (which it shouldn't).

@benwtrent
Copy link
Member

If I change the lines:

(aVoid, e) -> threadPool.executor(UTILITY_THREAD_POOL_NAME).execute(() -> shutdown(e))

to

(aVoid, e) -> this::shutdown(e))

It reproduces reliably.

I do think there is some thread lock somewhere that causes the shutdown to fail. But we always push out to the utility thread pool. I am not sure where its occurring.

benwtrent added a commit that referenced this issue Nov 4, 2021
It is possible that the model snapshot upgrader task is never fully shutdown.

This could occur in the following two scenarios:

 - The task state is failed being set to failed (we should kill the task in that situation)
 - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor.

To the user, the failure looks like the REST request NEVER returning. In tests, it looks like:

```
13:10:20     ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]]
```
As the request is spuriously retried and thus fails as the previous task never shutdown.

closes #69276
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Nov 4, 2021
It is possible that the model snapshot upgrader task is never fully shutdown.

This could occur in the following two scenarios:

 - The task state is failed being set to failed (we should kill the task in that situation)
 - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor.

To the user, the failure looks like the REST request NEVER returning. In tests, it looks like:

```
13:10:20     ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]]
```
As the request is spuriously retried and thus fails as the previous task never shutdown.

closes elastic#69276
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Nov 4, 2021
It is possible that the model snapshot upgrader task is never fully shutdown.

This could occur in the following two scenarios:

 - The task state is failed being set to failed (we should kill the task in that situation)
 - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor.

To the user, the failure looks like the REST request NEVER returning. In tests, it looks like:

```
13:10:20     ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]]
```
As the request is spuriously retried and thus fails as the previous task never shutdown.

closes elastic#69276
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Nov 4, 2021
It is possible that the model snapshot upgrader task is never fully shutdown.

This could occur in the following two scenarios:

 - The task state is failed being set to failed (we should kill the task in that situation)
 - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor.

To the user, the failure looks like the REST request NEVER returning. In tests, it looks like:

```
13:10:20     ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]]
```
As the request is spuriously retried and thus fails as the previous task never shutdown.

closes elastic#69276
benwtrent added a commit that referenced this issue Nov 4, 2021
It is possible that the model snapshot upgrader task is never fully shutdown.

This could occur in the following two scenarios:

 - The task state is failed being set to failed (we should kill the task in that situation)
 - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor.

To the user, the failure looks like the REST request NEVER returning. In tests, it looks like:

```
13:10:20     ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]]
```
As the request is spuriously retried and thus fails as the previous task never shutdown.

closes #69276
elasticsearchmachine pushed a commit that referenced this issue Nov 4, 2021
It is possible that the model snapshot upgrader task is never fully shutdown.

This could occur in the following two scenarios:

 - The task state is failed being set to failed (we should kill the task in that situation)
 - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor.

To the user, the failure looks like the REST request NEVER returning. In tests, it looks like:

```
13:10:20     ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]]
```
As the request is spuriously retried and thus fails as the previous task never shutdown.

closes #69276

Co-authored-by: Elastic Machine <[email protected]>
elasticsearchmachine pushed a commit that referenced this issue Nov 4, 2021
It is possible that the model snapshot upgrader task is never fully shutdown.

This could occur in the following two scenarios:

 - The task state is failed being set to failed (we should kill the task in that situation)
 - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor.

To the user, the failure looks like the REST request NEVER returning. In tests, it looks like:

```
13:10:20     ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]]
```
As the request is spuriously retried and thus fails as the previous task never shutdown.

closes #69276
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants