[CI] MlJobSnapshotUpgradeIT.testSnapshotUpgrader #69276

pugnascotia · 2021-02-19T14:44:54Z

Build scan:

https://gradle-enterprise.elastic.co/s/eudgfems3awkq

Repro line:

./gradlew ':x-pack:qa:rolling-upgrade:v7.12.0#upgradedClusterTest' -Dtests.class="org.elasticsearch.upgrades.MlJobSnapshotUpgradeIT" -Dtests.method="testSnapshotUpgrader" -Dtests.seed=97DF6D86FBC6515D -Dtests.security.manager=true -Dtests.bwc=true -Dtests.locale=en-CA -Dtests.timezone=Africa/Nouakchott -Druntime.java=8

Reproduces locally?:

No

Applicable branches:

7.x

Failure history:

https://gradle-enterprise.elastic.co/scans/tests?search.relativeStartTime=P7D&search.timeZoneId=Europe/London&tests.container=org.elasticsearch.upgrades.MlJobSnapshotUpgradeIT&tests.sortField=FAILED&tests.test=testSnapshotUpgrader&tests.unstableOnly=true

Failure excerpt:

org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-02-19T14:44:57Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2021-02-19T16:28:35Z

The exception occurred at 13:10:20:

13:10:20 org.elasticsearch.upgrades.MlJobSnapshotUpgradeIT > testSnapshotUpgrader FAILED
13:10:20     ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]]

But the original upgrade was started at 13:09:20:

[2021-02-19T13:09:20,557][INFO ][o.e.x.m.a.TransportUpgradeJobModelSnapshotAction] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] sending start upgrade request
[2021-02-19T13:09:20,592][DEBUG][o.e.x.m.j.JobNodeSelector] [v7.12.0-0] Falling back to allocating job [ml-snapshots-upgrade-job] by job counts because its memory requirement was not available
[2021-02-19T13:09:20,592][DEBUG][o.e.x.m.j.JobNodeSelector] [v7.12.0-0] selected node [{v7.12.0-0}{x3-GV25RSgOwS_Snau5PVg}{hPtI9NQQTd2mMzqUaBUQWQ}{127.0.0.1}{127.0.0.1:44533}{cdfhilmrstw}{ml.machine_memory=101362774016, upgraded=true, xpack.installed=true, transform.node=true, testattr=test, ml.max_open_jobs=20, ml.max_jvm_size=518979584}] for job [ml-snapshots-upgrade-job]
[2021-02-19T13:09:20,666][INFO ][o.e.x.m.j.s.u.SnapshotUpgradeTaskExecutor] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] starting to execute task
[2021-02-19T13:09:20,704][INFO ][o.e.x.m.j.p.a.NativeAutodetectProcessFactory] [v7.12.0-0] Restoring quantiles for job 'ml-snapshots-upgrade-job'
[2021-02-19T13:09:20,705][DEBUG][o.e.x.m.p.NativeController] [v7.12.0-0] Command [1]: starting process with command [./autodetect, --lengthEncodedInput, --maxAnomalyRecords=500, --config=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/config1045921840463316457.json, --quantilesState=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/ml-snapshots-upgrade-job_quantiles_2348509639247300197080.json, --deleteStateFiles, --logPipe=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/autodetect_ml-snapshots-upgrade-job-1613739961_log_614402, --input=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/autodetect_ml-snapshots-upgrade-job-1613739961_input_614402, --inputIsPipe, --output=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/autodetect_ml-snapshots-upgrade-job-1613739961_output_614402, --outputIsPipe, --restore=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/autodetect_ml-snapshots-upgrade-job-1613739961_restore_614402, --restoreIsPipe, --persist=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/autodetect_ml-snapshots-upgrade-job-1613739961_persist_614402, --persistIsPipe, --namedPipeConnectTimeout=10]
[2021-02-19T13:09:21,039][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CLogger.cc@333] Logger is logging to named pipe /var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/opensuse-15-1&&immutable/x-pack/qa/rolling-upgrade/build/testclusters/v7.12.0-0/tmp/autodetect_ml-snapshots-upgrade-job-1613739961_log_614402
[2021-02-19T13:09:21,040][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [Main.cc@150] autodetect (64 bit): Version 7.13.0-SNAPSHOT (Build a113c9756dfdff) Copyright (c) 2021 Elasticsearch BV
[2021-02-19T13:09:21,040][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CProcessPriority_Linux.cc@33] Successfully increased OOM killer adjustment via /proc/self/oom_score_adj
[2021-02-19T13:09:21,040][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CSystemCallFilter_Linux.cc@141] Seccomp BPF filters available
[2021-02-19T13:09:21,041][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CSystemCallFilter_Linux.cc@167] Seccomp BPF installed
[2021-02-19T13:09:21,123][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CResourceMonitor.cc@77] Setting model memory limit to 1024 MB
[2021-02-19T13:09:21,123][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CLengthEncodedInputParser.cc@48] Length encoded input parser input is not connected to stdin
[2021-02-19T13:09:21,123][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CHierarchicalResultsNormalizer.cc@290] 0 influencer bucket normalizers, 0 influencer normalizers, 1 partition normalizers, 0 person normalizers and 1 leaf normalizers restored from JSON stream
[2021-02-19T13:09:21,134][TRACE][o.e.x.m.j.p.StateStreamer] [v7.12.0-0] ES API CALL: get ID ml-snapshots-upgrade-job_model_state_1613739961#1 from index .ml-state*
[2021-02-19T13:09:21,142][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CStateDecompressor.cc@169] Explicit end-of-stream marker found in document with index 1
[2021-02-19T13:09:21,142][TRACE][o.e.x.m.j.p.StateStreamer] [v7.12.0-0] ES API CALL: get ID ml-snapshots-upgrade-job_categorizer_state#1 from index .ml-state*
[2021-02-19T13:09:21,142][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@893] Processing is already complete to time 1491037200
[2021-02-19T13:09:21,144][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyDetector.cc@115] CAnomalyDetector(): count by count for '', first time = 0, bucketLength = 3600, m_LastBucketEndTime = 0
[2021-02-19T13:09:21,144][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@1043] Restoring state for detector with key '0==individual count/1/0//count////'
[2021-02-19T13:09:21,144][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyDetector.cc@115] CAnomalyDetector(): mean value partition=series/foo for 'foo', first time = 0, bucketLength = 3600, m_LastBucketEndTime = 0
[2021-02-19T13:09:21,145][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@1043] Restoring state for detector with key '0==individual metric mean/0/0/value///series//foo'
[2021-02-19T13:09:21,146][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@829] Finished restoration, with 2 detectors
[2021-02-19T13:09:21,146][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@850] Setting lastBucketEndTime to 1491037200 in detector for 'mean value partition=series/foo'
[2021-02-19T13:09:21,147][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@850] Setting lastBucketEndTime to 1491037200 in detector for 'count by count'
[2021-02-19T13:09:21,156][DEBUG][o.e.x.m.j.p.a.JobModelSnapshotUpgrader] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] waiting for flush [1]
[2021-02-19T13:09:21,156][DEBUG][o.e.x.m.j.p.a.o.JobSnapshotUpgraderResultProcessor] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] Flush acknowledgement parsed from output for ID 1
[2021-02-19T13:09:21,157][DEBUG][o.e.x.m.j.p.a.JobModelSnapshotUpgrader] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] flush completed [1]
[2021-02-19T13:09:21,158][DEBUG][o.e.x.m.j.p.a.JobModelSnapshotUpgrader] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] flush [1] acknowledged requesting state write
[2021-02-19T13:09:21,235][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@1291] Persisted state for 'count by count'
[2021-02-19T13:09:21,235][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CAnomalyJob.cc@1291] Persisted state for 'mean value partition=series/foo'
[2021-02-19T13:09:21,237][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [v7.12.0-0] [ml-snapshots-upgrade-job-1613739961] [autodetect/620067] [CModelSnapshotJsonWriter.cc@81] Wrote model snapshot report with ID 1613739961 for: State persisted due to job close at 2021-02-19T13:06:01+0000, latest final results at 1491033600
[2021-02-19T13:09:21,240][TRACE][o.e.x.m.j.p.JobResultsPersister] [v7.12.0-0] [ml-snapshots-upgrade-job] ES API CALL: to index .ml-anomalies-.write-ml-snapshots-upgrade-job with ID [ml-snapshots-upgrade-job_model_snapshot_1613739961]
[2021-02-19T13:09:21,251][TRACE][o.e.x.m.p.IndexingStateProcessor] [v7.12.0-0] [ml-snapshots-upgrade-job] Persisting job state document: index [.ml-state-000001], length [8006]

So it looks like:

The first upgrade task didn't end despite appearing to have done everything it needed to do
The test harness retried the request after 60 seconds, and this unexpected retry generated the exception that failed the test

BigPandaToo · 2021-04-23T14:48:28Z

Looks like the same issue in 7.12: https://gradle-enterprise.elastic.co/s/oi3b5ucjmpnw4

cbuescher · 2021-04-28T14:29:25Z

Another one today on 7.13: https://gradle-enterprise.elastic.co/s/7n2o6ldfztrwk

danhermann · 2021-09-24T12:38:11Z

Another instance in 7.x: https://gradle-enterprise.elastic.co/s/yko2vg2p5pkko

ywangd · 2021-11-04T04:35:31Z

Another one on 7.15 https://gradle-enterprise.elastic.co/s/6ggnqk4wcnox6/tests/:x-pack:qa:rolling-upgrade:v7.10.2%23upgradedClusterTest/org.elasticsearch.upgrades.MlJobSnapshotUpgradeIT/testSnapshotUpgrader?top-execution=1

benwtrent · 2021-11-04T11:50:19Z

OK, I have confirmed robert's theory:

[2021-11-04T03:40:42,127][TRACE][o.e.x.m.j.p.JobResultsPersister] [v7.10.2-2] [ml-snapshots-upgrade-job] ES API CALL: to index .ml-anomalies-.write-ml-snapshots-upgrade-job with ID [ml-snapshots-upgrade-job_model_snapshot_1635997051]
[2021-11-04T03:40:42,140][TRACE][o.e.x.m.p.IndexingStateProcessor] [v7.10.2-2] [ml-snapshots-upgrade-job] Persisting job state document: index [.ml-state-000001], length [14017]
[2021-11-04T03:42:12,302][INFO ][o.e.c.c.Coordinator      ] [v7.10.2-2] master node [{v7.10.2-0}{3tEugoPCR5u5aqWpQSQtKg}{jwTqX71yTUS-ogyPvTVPrw}{127.0.0.1}{127.0.0.1:41235}{cdfhilmrstw}{ml.machine_memory=101252943872, upgraded=true, xpack.installed=true, t

NOTE the time of the state document being persisted was [2021-11-04T03:40:42,140]

API call was attempted again in this most recent test run at [2021-11-04T00:42:11,848] (almost a whole minute after, basically right at 90s).

Logs from the master node show the task starting at the 41 minute mark (lines up with the assigned node)

[2021-11-04T03:40:41,602][INFO ][o.e.x.m.a.TransportUpgradeJobModelSnapshotAction] [v7.10.2-0] [ml-snapshots-upgrade-job] [1635997051] sending start upgrade request
[2021-11-04T03:40:41,603][DEBUG][o.e.x.m.j.JobNodeSelector] [v7.10.2-0] selected node [{v7.10.2-2}{6YADyR23Q_qxiaOw9Izv2w}{5Bo4iGtYQWObWhcWzY1feA}{127.0.0.1}{127.0.0.1:41481}{cdfhilmrstw}{ml.machine_memory=101252943872, upgraded=true, xpack.installed=true, transform.node=true, testattr=test, ml.max_open_jobs=512, ml.max_jvm_size=518979584}] for job [ml-snapshots-upgrade-job]
[2021-11-04T03:42:12.252616337Z] [BUILD] Stopping node

There are no other logs that indicate any weird behavior.

This leaves the following options:

The process isn't finishing once the snapshot was stored
The listener isn't being notified once the process is finished

I will go through with a fine-toothed comb to see what turns up.

The task close handler was never called. It logs finished upgrading snapshot when its called, but that log message isn't anywhere.

benwtrent · 2021-11-04T12:55:10Z

Looking at the logs and the code. The native process received the request to write the new state, and the new state was persisted.

So, we made it to this part in the task:

https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/JobModelSnapshotUpgrader.java#L278-L289

State is persisted, so the callback should have fired on the utility thread.

It may be we are tying up an executor and that causes an eventual timeout + failure that isn't logged. Or, waiting for the job to finish is actually taking that long (which it shouldn't).

benwtrent · 2021-11-04T13:59:22Z

If I change the lines:

(aVoid, e) -> threadPool.executor(UTILITY_THREAD_POOL_NAME).execute(() -> shutdown(e))

to

(aVoid, e) -> this::shutdown(e))

It reproduces reliably.

I do think there is some thread lock somewhere that causes the shutdown to fail. But we always push out to the utility thread pool. I am not sure where its occurring.

It is possible that the model snapshot upgrader task is never fully shutdown. This could occur in the following two scenarios: - The task state is failed being set to failed (we should kill the task in that situation) - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor. To the user, the failure looks like the REST request NEVER returning. In tests, it looks like: ``` 13:10:20 ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]] ``` As the request is spuriously retried and thus fails as the previous task never shutdown. closes #69276

It is possible that the model snapshot upgrader task is never fully shutdown. This could occur in the following two scenarios: - The task state is failed being set to failed (we should kill the task in that situation) - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor. To the user, the failure looks like the REST request NEVER returning. In tests, it looks like: ``` 13:10:20 ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]] ``` As the request is spuriously retried and thus fails as the previous task never shutdown. closes elastic#69276

It is possible that the model snapshot upgrader task is never fully shutdown. This could occur in the following two scenarios: - The task state is failed being set to failed (we should kill the task in that situation) - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor. To the user, the failure looks like the REST request NEVER returning. In tests, it looks like: ``` 13:10:20 ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]] ``` As the request is spuriously retried and thus fails as the previous task never shutdown. closes #69276

It is possible that the model snapshot upgrader task is never fully shutdown. This could occur in the following two scenarios: - The task state is failed being set to failed (we should kill the task in that situation) - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor. To the user, the failure looks like the REST request NEVER returning. In tests, it looks like: ``` 13:10:20 ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]] ``` As the request is spuriously retried and thus fails as the previous task never shutdown. closes #69276 Co-authored-by: Elastic Machine <[email protected]>

It is possible that the model snapshot upgrader task is never fully shutdown. This could occur in the following two scenarios: - The task state is failed being set to failed (we should kill the task in that situation) - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor. To the user, the failure looks like the REST request NEVER returning. In tests, it looks like: ``` 13:10:20 ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]] ``` As the request is spuriously retried and thus fails as the previous task never shutdown. closes #69276

pugnascotia added >test-failure Triaged test failures from CI :ml Machine learning v7.12.1 labels Feb 19, 2021

elasticmachine added the Team:ML Meta label for the ML team label Feb 19, 2021

droberts195 mentioned this issue Feb 19, 2021

[ML] Possible race condition between multiple memoryTracker.isRecentlyRefreshed() calls #69289

Closed

gwbrown added v7.12.2 and removed v7.12.1 labels Apr 14, 2021

joegallo removed the v7.12.2 label Jun 30, 2021

benwtrent self-assigned this Nov 4, 2021

benwtrent mentioned this issue Nov 4, 2021

[ML] fix bug in model snapshot upgrader where task never exits #80347

Merged

benwtrent closed this as completed in #80347 Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] MlJobSnapshotUpgradeIT.testSnapshotUpgrader #69276

[CI] MlJobSnapshotUpgradeIT.testSnapshotUpgrader #69276

pugnascotia commented Feb 19, 2021

elasticmachine commented Feb 19, 2021

droberts195 commented Feb 19, 2021

BigPandaToo commented Apr 23, 2021

cbuescher commented Apr 28, 2021

danhermann commented Sep 24, 2021

ywangd commented Nov 4, 2021

benwtrent commented Nov 4, 2021 •

edited

Loading

benwtrent commented Nov 4, 2021

benwtrent commented Nov 4, 2021

[CI] MlJobSnapshotUpgradeIT.testSnapshotUpgrader #69276

[CI] MlJobSnapshotUpgradeIT.testSnapshotUpgrader #69276

Comments

pugnascotia commented Feb 19, 2021

elasticmachine commented Feb 19, 2021

droberts195 commented Feb 19, 2021

BigPandaToo commented Apr 23, 2021

cbuescher commented Apr 28, 2021

danhermann commented Sep 24, 2021

ywangd commented Nov 4, 2021

benwtrent commented Nov 4, 2021 • edited Loading

benwtrent commented Nov 4, 2021

benwtrent commented Nov 4, 2021

benwtrent commented Nov 4, 2021 •

edited

Loading