-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch job reschedule attempts sometimes not being respected #20462
Comments
Digging into this a bit more, not familiar with the codebase but the fact that {
"ID": "b9fd3bb7-7b8b-28cd-a848-717fc995893c",
"RescheduleTracker": null
}
{
"ID": "b550b588-fcf3-8431-e00d-c1271e307dd0",
"RescheduleTracker": {
"Events": [
{
"RescheduleTime": 1713540616182219192,
"PrevAllocID": "b9fd3bb7-7b8b-28cd-a848-717fc995893c",
"PrevNodeID": "2d4576df-c1f6-8b8a-a0da-ca05123cf849",
"Delay": 5000000000
}
]
}
}
{
"ID": "52a123ff-2e23-4860-e802-3c314e2b3a6a",
"RescheduleTracker": null
}
{
"ID": "46eb89fa-ef85-19cc-b79a-c60edf63ba01",
"RescheduleTracker": {
"Events": [
{
"RescheduleTime": 1713540907034600526,
"PrevAllocID": "52a123ff-2e23-4860-e802-3c314e2b3a6a",
"PrevNodeID": "7349b23c-b638-15d2-ec50-f9ca5e4103f3",
"Delay": 5000000000
}
]
}
}
{
"ID": "51147d13-cef8-a522-091e-8ede42887a17",
"RescheduleTracker": null
}
{
"ID": "eb529550-4ec9-eb9b-24da-f3892da18b81",
"RescheduleTracker": null
}
{
"ID": "1cd8ae47-39d8-d5e1-7b34-a42678fcee66",
"RescheduleTracker": {
"Events": [
{
"RescheduleTime": 1713541254356376383,
"PrevAllocID": "eb529550-4ec9-eb9b-24da-f3892da18b81",
"PrevNodeID": "5b244e29-d814-84dd-d83c-a235ada79a3e",
"Delay": 5000000000
}
]
}
}
{
"ID": "fa019977-b038-a8e8-6b3b-837a4bf04661",
"RescheduleTracker": null
}
{
"ID": "0505a317-7470-06cf-dfba-9ecef6e92e52",
"RescheduleTracker": {
"Events": [
{
"RescheduleTime": 1713541591399485131,
"PrevAllocID": "fa019977-b038-a8e8-6b3b-837a4bf04661",
"PrevNodeID": "3ebe29d2-5c40-643c-6ddb-345a01d315c5",
"Delay": 5000000000
}
]
}
}
{
"ID": "92022347-e05a-c3b2-877f-ba3458fb00fc",
"RescheduleTracker": {
"Events": [
{
"RescheduleTime": 1713541591399485131,
"PrevAllocID": "fa019977-b038-a8e8-6b3b-837a4bf04661",
"PrevNodeID": "3ebe29d2-5c40-643c-6ddb-345a01d315c5",
"Delay": 5000000000
},
{
"RescheduleTime": 1713541647009844424,
"PrevAllocID": "0505a317-7470-06cf-dfba-9ecef6e92e52",
"PrevNodeID": "214d3a1c-3b69-8d67-c707-812bf8a943c6",
"Delay": 5000000000
}
]
}
} It seems like normally |
Hi @sagebind! Any chance that these allocations failed a placement on reschedule, and then got placed? There's a known issue around that, which I have a test for in #12319 but we never got back to figuring out how to actually fix it. You should be able to verify this by looking at the evaluation history for the job. |
@tgross Here's the evaluation history for a couple of jobs that placed 7 allocations even though 5df4fd90-d0b3-46e1-992d-45f61b08c537-allocations.json The only evaluation triggers I see are |
Just a heads up that I'm digging into this. The cause of the problem seems straightforward and tied up in the logic we use for marking allocations for reschedule, but the solution is going to take a bit of figuring out. I'll update here as I know more. |
While working on #20462, I discovered that some of the scheduler tests for disconnected clients making long blocking queries. The tests used `testutil.WaitForResult` to wait for an evaluation to be written to the state store. The evaluation was never written, but the tests were not correctly returning an error for an empty query. This resulted in the tests blocking for 5s and then continuing anyways. In practice, the evaluation is never written to the state store as part of the test harness `Process` method, so this test assertion was meaningless. Remove the broken assertion from the two top-level tests that used it, and upgrade these tests to use `shoenig/test` in the process. This will save us ~50s per test run.
…20615) While working on #20462, I discovered that some of the scheduler tests for disconnected clients making long blocking queries. The tests used `testutil.WaitForResult` to wait for an evaluation to be written to the state store. The evaluation was never written, but the tests were not correctly returning an error for an empty query. This resulted in the tests blocking for 5s and then continuing anyways. In practice, the evaluation is never written to the state store as part of the test harness `Process` method, so this test assertion was meaningless. Remove the broken assertion from the two top-level tests that used it, and upgrade these tests to use `shoenig/test` in the process. This will save us ~50s per test run.
…20615) While working on #20462, I discovered that some of the scheduler tests for disconnected clients making long blocking queries. The tests used `testutil.WaitForResult` to wait for an evaluation to be written to the state store. The evaluation was never written, but the tests were not correctly returning an error for an empty query. This resulted in the tests blocking for 5s and then continuing anyways. In practice, the evaluation is never written to the state store as part of the test harness `Process` method, so this test assertion was meaningless. Remove the broken assertion from the two top-level tests that used it, and upgrade these tests to use `shoenig/test` in the process. This will save us ~50s per test run.
…20615) While working on #20462, I discovered that some of the scheduler tests for disconnected clients making long blocking queries. The tests used `testutil.WaitForResult` to wait for an evaluation to be written to the state store. The evaluation was never written, but the tests were not correctly returning an error for an empty query. This resulted in the tests blocking for 5s and then continuing anyways. In practice, the evaluation is never written to the state store as part of the test harness `Process` method, so this test assertion was meaningless. Remove the broken assertion from the two top-level tests that used it, and upgrade these tests to use `shoenig/test` in the process. This will save us ~50s per test run.
…d blocking into release/1.6.x (#20616) While working on #20462, I discovered that some of the scheduler tests for disconnected clients making long blocking queries. The tests used `testutil.WaitForResult` to wait for an evaluation to be written to the state store. The evaluation was never written, but the tests were not correctly returning an error for an empty query. This resulted in the tests blocking for 5s and then continuing anyways. In practice, the evaluation is never written to the state store as part of the test harness `Process` method, so this test assertion was meaningless. Remove the broken assertion from the two top-level tests that used it, and upgrade these tests to use `shoenig/test` in the process. This will save us ~50s per test run. Co-authored-by: Tim Gross <[email protected]>
…d blocking into release/1.7.x (#20617) While working on #20462, I discovered that some of the scheduler tests for disconnected clients making long blocking queries. The tests used `testutil.WaitForResult` to wait for an evaluation to be written to the state store. The evaluation was never written, but the tests were not correctly returning an error for an empty query. This resulted in the tests blocking for 5s and then continuing anyways. In practice, the evaluation is never written to the state store as part of the test harness `Process` method, so this test assertion was meaningless. Remove the broken assertion from the two top-level tests that used it, and upgrade these tests to use `shoenig/test` in the process. This will save us ~50s per test run. Co-authored-by: Tim Gross <[email protected]>
While working on #20462 #12319 I found that some of our scheduler tests around down nodes or disconnected clients were enforcing invariants that were unclear. This changeset pulls out some minor refactorings so that the bug fix PR is easier to review. This includes: * Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch in #12319 anyways. * Adding test names to the node down test * Update the disconnected client test so that we always re-process the pending/blocked eval it creates; this eliminates 2 redundant sub-tests. * Update the disconnected client test assertions so that they're explicit in the test setup rather than implied by whether we re-process the pending/blocked eval. Ref: #20462 Ref: #12319
While working on #20462 #12319 I found that some of our scheduler tests around down nodes or disconnected clients were enforcing invariants that were unclear. This changeset pulls out some minor refactorings so that the bug fix PR is easier to review. This includes: * Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch in #12319 anyways. * Adding test names to the node down test * Update the disconnected client test so that we always re-process the pending/blocked eval it creates; this eliminates 2 redundant sub-tests. * Update the disconnected client test assertions so that they're explicit in the test setup rather than implied by whether we re-process the pending/blocked eval. Ref: #20462 Ref: #12319
We are seeing something tangentially related to this. Rescheduling is disabled, but occasionally allocations are getting rescheduled after failing or even completing. No rhyme or reason identified yet for why this is occurring, but I'm wondering if we're dealing with the same issue identified above. |
Just wanted to drop an update here. I've got a working solution to the problem described in #12147 and #17072, but @sagebind's report is slightly different because there's no failed placement visible. I strongly suspect the solution there will also fix this problem because of how we attach the reschedule tracker, but until I can reproduce this problem I don't feel confident in closing this out with it. One of the issues I'm running into with the available evidence is that some of the evaluations in the "chain" of evaluations are missing. That shouldn't matter here, as all the evaluations that are missing should be the first eval after the allocation failure, which then creates a new eval to be run after the For job example allocs
I'm still working on trying to figure out the code path that can reproduce this. @gibbonsjohnm wrote:
It might be related or not. Just by way of clarification, you mean you've got a reschedule block like this (as documented), right?:
If so, it would be helpful if you could provide the allocations and evaluations list like @sagebind did above. Debug-level server log snippets from when the evaluations were processed would be helpful too. |
I'm at the end of my week here, but wanted to leave one last update. It looks like in e8efe2d we introduced a code path where "untainted" allocs but are disconnecting should pass through the reschedule check. But in practice those untainted allocs that aren't disconnecting also pass through the reschedule check now too. This could potentially cause a spurious reschedule, but I still need to figure out the exact circumstances where that could happen. The fix for that would be: diff --git a/scheduler/reconcile_util.go b/scheduler/reconcile_util.go
index 09cc0e42f5..661db532ec 100644
--- a/scheduler/reconcile_util.go
+++ b/scheduler/reconcile_util.go
@@ -414,6 +414,7 @@ func (a allocSet) filterByRescheduleable(isBatch, isDisconnecting bool, now time
isUntainted, ignore := shouldFilter(alloc, isBatch)
if isUntainted && !isDisconnecting {
untainted[alloc.ID] = alloc
+ continue // TODO: right?
}
if ignore { But right now that doesn't cause any tests to fail, which at least suggests test coverage may be why we didn't see this earlier. The only way that explains the failed allocs not getting their reschedule tracker is if they don't get marked rescheduable in that code path and just get dropped in the "untainted" list. That would cause a later stage of the reconciler to determine we were short healthy allocs and backfill them. I'll pick this all back up on Monday. |
Ok folks, I've got #12319 marked as ready for review to cover the similar-but-not-quite identical case, and that should ship in the next regular version of Nomad. For the remaining cases described in this issue, I have a high degree of confidence at this point that we're not going anywhere near the reschedule logic. Instead, the scheduler is considering the job to be "underprovisioned", so it's creating allocations that it doesn't consider to be replacements for existing allocations at all. That is, the scheduler isn't saying "I have a Unfortunately I don't have a reproduction for that yet. What could help me at this point is if you're experiencing allocations being replaced without a reschedule tracker and without there having been a placement failure during rescheduling, or completed allocations getting replaced, is to post the following:
@sagebind I did a little digging on the OOM report exit codes, and that looks like it's a case of #13119, which we're aware of and are waiting for a fix from upstream |
If you are reading this to report you've seen similar issues and have not followed along with the above: The key factor in this issue is that the replacement is happening without an apparent attempt to reschedule. If you have seen placement failures during rescheduling, then you're seeing #17072 #12147, which we already have a patch for. |
We do have that reschedule block, yes. If the issue happens again, I will provide the data requested. For what it's worth, these are batch jobs. |
Nomad version
Output from
nomad version
Operating system and Environment details
Output from
uname -a
:Issue
It seems that
attempts
forreschedule
set on a job is sometimes not being respected. This job configuration was formerly known to be for sure working when we were running Nomad 1.4 (we recently upgraded versions to catch up with the latest). We run quite a bit of small, discrete operations using batch jobs, and we use the reschedule policy to allow Nomad to automatically retry the operation with a new allocation on a different node, in case the underlying node had an issue, but we haveattempts
set to2
.Normally, this does what we expect -- if a task exits with a nonzero exit code or gets
OOM Killed
, the allocation fails (becauserestart.attempts
is set to zero and mode is"fail"
), and up to 2 additional allocations may be created to attempt the task again. However, since upgrading to 1.7, we've noticed that sometimes this limit is not being enforced, and Nomad is behaving as ifunlimited
were set totrue
, which it is not. This results in some failing tasks generating an infinite number of new allocations over time and filling up our clusters with wasted work, and causing quite a bit of problems for us. We've had jobs withattempts
set to2
somehow end up with hundreds of failed allocations. Here's a screenshot of the dashboard for a job that already had 63 before we found and purged it manually:So far, I haven't found a rhyme or reason yet as to why it seems like only sometimes this behavior happens, but over time the percentage of our cluster filled with these "zombie" jobs grows until most of our capacity is taken up by them, leaving no room for new work.
My current theory is that sometimes, when the task gets OOM Killed with the docker driver, the status code is seen as
0
and thus isn't incrementing some sort of counter somewhere that keeps track of the number of failed allocations so far. Normally, the event message isBut for these "zombie" jobs that reschedule infinitely, it seems like the event message is
Below are the allocations and events for the example
zombie
job in the reproduction steps section that works correctly and stops afterattempts
reschedules (the JSON is too big for GitHub comment):zombie-allocations.json
Now below is the response of allocations from a buggy job in production that spawned more allocations than allowed by its
attempts
limit:buggy-allocations.json
Reproduction steps
I haven't been able to consistently reproduce this issue yet, but the general idea would be to use a job like this one:
Expected Result
The job should be rescheduled up to
attempts
times, and then stay in a failed state after that.Actual Result
The job is rescheduled without any limit, creating far more than
attempts
allocations over time.Job file (if appropriate)
We use the Nomad API programmatically to generate these jobs so we don't have an HCL for such jobs, but here is the job definition of an example job that experienced this problem according to the Nomad API (with some sensitive bits redacted).
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: