-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reschedule attempts limit is not respected when there is placement failures #12147
Comments
Hey @avandierast Thanks for bringing this to our attention! We'll look into this and let you know what we find 👍 |
Hi @avandierast! I just tested this myself and wasn't able to reproduce any unexpected behavior. There's two things that might be at play here:
jobspecjob "example" {
parameterized {
}
datacenters = ["dc1"]
type = "batch"
group "example" {
reschedule {
attempts = 1
interval = "24h"
unlimited = false
}
restart {
attempts = 0
mode = "fail"
}
task "example" {
driver = "docker"
config {
image = "alpine"
entrypoint = ["/bin/sleep", "10", "&&", "false"]
}
resources {
memory = 10000
}
}
}
} I'll end up with the following results:
This evaluation created a "blocked eval" (the eval The
Then I get the following dispatched job:
So there's exactly one rescheduled allocation, and the allocation status for that final alloc looks like the following. Note the "Reschedule Attempts" line.
My evaluations list looks like the following:
|
Hello @tgross, thanks for your help analysing this problem :) The screen I've pasted at the end of my first post is the equivalent of
If this is correct, we can see on my screenshot that there is 3 re-schedules but my job is configured to limit to one. And thus, the underlying docker container of the task is run 3 times. We see this behaviour when there is scarce resources and the evaluation after a reschedule fails. If the placements are successful two times in a row, then the job stops correctly after two fails. If my example with 4 simultaneous batched job is not enough to have a failed evaluation after a reschedule, I think that if you run more than 4, maybe 10 or even more, it should trigger the problem. |
Pretty close! Any time an allocation fails and it's eligible for a reschedule (ref So in both those cases there's been an allocation that failed and an evaluation created, which goes into the scheduler. The That means that a single allocation reschedule attempt could have 3 (or more!) evaluations and they can transition between different states. The timeline might look like:
That's 3 evaluations for a single reschedule check. It isn't until the new allocation is created and placed that we treat it as one "attempt" for purposes of |
Thanks for the detailed explanation :) Yes, I see there is 3 evaluations in the list for one real attempt. I didn't talk about the queued-allocs since it doesn't give much more information after a placement-failure=True for my problem. If you look at my eval list in the first post, there is 10 evaluations: 3x3 + 1 -> 3 evaluations per attempts where there was a failing placement + the last failing attempt. If this eval list isn't enough to show that there is more than 2 attempts, I can make another example where the container writes into a file and we can see that there is too much attempts. The nomad configuration file for the agent:
The modified job file to have a container that writes to a file on the host:
My list of jobs:
And after some time with the 4 dispatched job, the file on the host:
And I can let it run for a long time, it will continue to make attempts and write job_ids in the file... |
It might be but it's hard to map the screenshot to jobs without having the list of allocations for the job as well. So the example you've given is very helpful.
Ok that's very interesting, as I see you've got 7 allocation runs per job there in the output. To try to replicate, I took your jobspec and I modified it in three small ways:
jobspecjob "example" {
parameterized {
}
datacenters = ["dc1"]
type = "batch"
group "example" {
reschedule {
attempts = 1
interval = "24h"
unlimited = false
}
restart {
attempts = 0
mode = "fail"
}
volume "host_files" {
type = "host"
read_only = false
source = "shared_data"
}
task "example" {
driver = "docker"
volume_mount {
volume = "host_files"
destination = "/host_files"
read_only = false
}
config {
image = "alpine"
entrypoint = ["/bin/sh", "-c", "echo \"job ${NOMAD_JOB_ID} - alloc ${NOMAD_ALLOC_ID}\" >> /host_files/data && sleep 10 && false"]
}
resources {
memory = 128
}
}
}
} I register that job and dispatch it 4 times. I end up with the expected list of job status and evaluations:
And the contents of the data file are exactly what I'd expect. 2 allocations run for each job:
I've double checked this with both Nomad 1.2.6 and the current But it sounds you're seeing this problem consistently at your end too, which suggests it's not some weird concurrency problem. The job never ends up getting marked as |
I have also tried today on a MacBook Pro with the binary from nomad and it looped like expected 😅 What triggers me with the result of your I've used your job entrypoint. I've dispatched one job alone and it worked well. Then I've dispatched 4 at the same time and get back the infinite loop of attempts: they are never dead. nomad status
We can see that it worked well when the job was dispatched once, but not when there was 4: result:
I've attached the logs of nomad if you can find anything usefull to understand what is happening: I'm sorry I closed the nomad agent and forgot to call |
Ah, yes that's I've been missing! I was finally able to reproduce by setting
I went through the logs you provided with a fine-toothed comb and there's enough interleaving to make it hard to see where the issue is. Fortunately, I was able to reproduce with just two dispatches. I've broken that down by hand into a detailed walk through the life of the evaluations (which I've left at the bottom of the page in a The tl;dr of all that is that we process an eval for a rescheduleable alloc we create a "follow-up" eval. If that eval ends up getting blocked it looks like we're losing track of reschedules at that point. When that happens and we get the In any case, I suspect that the reason this is getting into a loop is because the two (or more) dispatched jobs just end up interleaving evaluations and so they keep knocking each other into the bad state. (new job dispatch sent) Evals in heap:
(new job dispatch sent) Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
Evals in heap:
(and so on) |
@avandierast I think I've got enough here to try to write a unit test in the |
Ok, I've got a failing unit test in #12319. While working on that I realized this bug is likely to impact a feature that my colleague @DerekStrickland is currently working on, so I'm going to pair up with him to work on a solution starting early next week. |
Hello @tgross. Happy to hear that you were able to reproduce :) |
I think I may found a similar situation that leads to the opposite result. In my case is also a batch job (meant to be run daily). The job "hangs" at the templating step (in my case, because it can't find a vault-based secret which is not there) and never retries. I mean:
Can this behaviour be another aspect of this issue? The 'restart' stanza from my job follows (while testing this I set the job to run every 5min). Current restart mode is "fail" but I also tested with 'mode=delay' with same results: periodic {
cron = "*/5 * * * *"
prohibit_overlap = "true"
}
(...)
restart {
interval = "5m"
attempts = 3
delay = "30s"
mode = "fail"
} |
Hi @next-jesusmanuelnavarro! The @avandierast the bug didn't turn out to be related to Derek's work, which just got merged. But we had an internal discussion about the bug and I think there's a couple approaches we can try, so I'll try to circle back to this issue shortly. |
Quick update on this. I've got a working patch at #12319, but there's a lingering issue described in #20462 (comment) which may be interrelated and I want to make sure I've done my due diligence here before trying to close this out. |
When an allocation fails it triggers an evaluation. The evaluation is processed and the scheduler sees it needs to reschedule, which triggers a follow-up eval. The follow-up eval creates a plan to `(stop 1) (place 1)`. The replacement alloc has a `RescheduleTracker` (or gets its `RescheduleTracker` updated). But in the case where the follow-up eval can't place all allocs (there aren't enough resources), it can create a partial plan to `(stop 1) (place 0)`. It then creates a blocked eval. The plan applier stops the failed alloc. Then when the blocked eval is processed, the job is missing an allocation, so the scheduler creates a new allocation. This allocation is _not_ a replacement from the perspective of the scheduler, so it's not handed off a `RescheduleTracker`. This changeset fixes this by annotating the reschedule tracker whenever the scheduler can't place a replacement allocation. We check this annotation for allocations that have the `stop` desired status when filtering out allocations to pass to the reschedule tracker. I've also included tests that cover this case and expands coverage of the relevant area of the code. Fixes: #12147 Fixes: #17072
This issue should be resolved in #12319 |
…12319) When an allocation fails it triggers an evaluation. The evaluation is processed and the scheduler sees it needs to reschedule, which triggers a follow-up eval. The follow-up eval creates a plan to `(stop 1) (place 1)`. The replacement alloc has a `RescheduleTracker` (or gets its `RescheduleTracker` updated). But in the case where the follow-up eval can't place all allocs (there aren't enough resources), it can create a partial plan to `(stop 1) (place 0)`. It then creates a blocked eval. The plan applier stops the failed alloc. Then when the blocked eval is processed, the job is missing an allocation, so the scheduler creates a new allocation. This allocation is _not_ a replacement from the perspective of the scheduler, so it's not handed off a `RescheduleTracker`. This changeset fixes this by annotating the reschedule tracker whenever the scheduler can't place a replacement allocation. We check this annotation for allocations that have the `stop` desired status when filtering out allocations to pass to the reschedule tracker. I've also included tests that cover this case and expands coverage of the relevant area of the code. Fixes: #12147 Fixes: #17072
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Operating system and Environment details
Fedora 35 with nomad agent --dev
Also reproduced on ubuntu 20.04 with nomad server and client on two separated VMs.
Issue
When there is a placement failure of an allocation after a reschedule, the reschedule attempts number of the task group is not respected at all.
We have this problem with parametrized batch, we don't know if it's the same with other job types.
Reproduction
For example with the following job definition (on a computer with less than 20GB of RAM for the node to be exhausted by the 10GB resources reservation):
Expected Result
The tasks are reschedule at least the number of attempts defined and then die, no matter if there was placement failure between the attempts.
For our example, we expect the container to be run only twice per dispatch.
Actual Result
The tasks are reschedule a lot before dying and sometimes never die.
We can see the problem easily in nomad UI in the job evaluations tab.
The job has a reschedule attempts of 1 per 24h but the dispatched job has 4 failures (alloc-failure with placement failures to False):
The text was updated successfully, but these errors were encountered: