-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] MlJobSnapshotUpgradeIT.testSnapshotUpgrader #69276
Comments
Pinging @elastic/ml-core (Team:ML) |
The exception occurred at 13:10:20:
But the original upgrade was started at 13:09:20:
So it looks like:
|
Looks like the same issue in 7.12: https://gradle-enterprise.elastic.co/s/oi3b5ucjmpnw4 |
Another one today on 7.13: https://gradle-enterprise.elastic.co/s/7n2o6ldfztrwk |
Another instance in 7.x: https://gradle-enterprise.elastic.co/s/yko2vg2p5pkko |
OK, I have confirmed robert's theory:
NOTE the time of the state document being persisted was API call was attempted again in this most recent test run at Logs from the master node show the task starting at the 41 minute mark (lines up with the assigned node)
There are no other logs that indicate any weird behavior. This leaves the following options:
I will go through with a fine-toothed comb to see what turns up. The task close handler was never called. It logs |
Looking at the logs and the code. The native process received the request to write the new state, and the new state was persisted. So, we made it to this part in the task: State is persisted, so the callback should have fired on the utility thread. It may be we are tying up an executor and that causes an eventual timeout + failure that isn't logged. Or, waiting for the job to finish is actually taking that long (which it shouldn't). |
If I change the lines:
to
It reproduces reliably. I do think there is some thread lock somewhere that causes the shutdown to fail. But we always push out to the utility thread pool. I am not sure where its occurring. |
It is possible that the model snapshot upgrader task is never fully shutdown. This could occur in the following two scenarios: - The task state is failed being set to failed (we should kill the task in that situation) - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor. To the user, the failure looks like the REST request NEVER returning. In tests, it looks like: ``` 13:10:20 ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]] ``` As the request is spuriously retried and thus fails as the previous task never shutdown. closes #69276
It is possible that the model snapshot upgrader task is never fully shutdown. This could occur in the following two scenarios: - The task state is failed being set to failed (we should kill the task in that situation) - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor. To the user, the failure looks like the REST request NEVER returning. In tests, it looks like: ``` 13:10:20 ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]] ``` As the request is spuriously retried and thus fails as the previous task never shutdown. closes elastic#69276
It is possible that the model snapshot upgrader task is never fully shutdown. This could occur in the following two scenarios: - The task state is failed being set to failed (we should kill the task in that situation) - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor. To the user, the failure looks like the REST request NEVER returning. In tests, it looks like: ``` 13:10:20 ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]] ``` As the request is spuriously retried and thus fails as the previous task never shutdown. closes elastic#69276
It is possible that the model snapshot upgrader task is never fully shutdown. This could occur in the following two scenarios: - The task state is failed being set to failed (we should kill the task in that situation) - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor. To the user, the failure looks like the REST request NEVER returning. In tests, it looks like: ``` 13:10:20 ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]] ``` As the request is spuriously retried and thus fails as the previous task never shutdown. closes elastic#69276
It is possible that the model snapshot upgrader task is never fully shutdown. This could occur in the following two scenarios: - The task state is failed being set to failed (we should kill the task in that situation) - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor. To the user, the failure looks like the REST request NEVER returning. In tests, it looks like: ``` 13:10:20 ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]] ``` As the request is spuriously retried and thus fails as the previous task never shutdown. closes #69276
It is possible that the model snapshot upgrader task is never fully shutdown. This could occur in the following two scenarios: - The task state is failed being set to failed (we should kill the task in that situation) - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor. To the user, the failure looks like the REST request NEVER returning. In tests, it looks like: ``` 13:10:20 ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]] ``` As the request is spuriously retried and thus fails as the previous task never shutdown. closes #69276 Co-authored-by: Elastic Machine <[email protected]>
It is possible that the model snapshot upgrader task is never fully shutdown. This could occur in the following two scenarios: - The task state is failed being set to failed (we should kill the task in that situation) - The executor doesn't fire in time to initiate the shutdown before we shutdown the executor. This is caused by a race condition between scheduling the shutdown and closing the executor. To the user, the failure looks like the REST request NEVER returning. In tests, it looks like: ``` 13:10:20 ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Cannot upgrade job [ml-snapshots-upgrade-job] snapshot [1613739961] as there is currently a snapshot for this job being upgraded]] ``` As the request is spuriously retried and thus fails as the previous task never shutdown. closes #69276
Build scan:
https://gradle-enterprise.elastic.co/s/eudgfems3awkq
Repro line:
Reproduces locally?:
No
Applicable branches:
7.x
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?search.relativeStartTime=P7D&search.timeZoneId=Europe/London&tests.container=org.elasticsearch.upgrades.MlJobSnapshotUpgradeIT&tests.sortField=FAILED&tests.test=testSnapshotUpgrader&tests.unstableOnly=true
Failure excerpt:
The text was updated successfully, but these errors were encountered: