-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad server reports 0 allocations placed, but is wrong #9054
Comments
Hi @benbuzbee!
The last bit about the previous job name may be the clue here. It might be helpful to know what you mean by "randomly generated data inside of it" to figure out why Nomad isn't seeing this as an update to the previous version of the job. The List Job Versions API may have more detail for you.
It's not being scheduled because Nomad doesn't see what's been registered as an update. Trying to say why it doesn't see an update would be essentially listing every field in the job and noting it hasn't changed. (You can imagine the log entries: "job name hasn't changed, template hasn't changed, taskgroup[0].task[0].resource.cpu hasn't changed, artifact hasn't changed", etc, etc.)
Are these logs getting post-processed by a remote log collector? (Like an ELK stack or the like.) If not, can you share your exact
I believe you want List Allocations For Evaluation, which maps from eval to allocation(s), or Read Allocation, which includes the
You may be able to determine more information by checking List Job Versions to get the definition of the previous version of the job (as understood by the scheduler) vs the one you have. |
Thanks Tim for the very detailed response. Random data: We generate a new cert with a random GUID in the SAN every time we schedule a job Would it make sense for the API to return something that just indicates "Status: Evaluation resulted in no updates because the job has not changed"? Maybe you are saying that is implied because there are no FailedTGAllocs in the response? The logs are processed by logstash :( we can ignore that problem for now since I guess it's a bit of a tangent; if it is interesting to you I can try to get better data for this specifically As far as this:
That is hard for this case since the event is past and was caught in the logs - Nomad no longer seems to remember the evaluation. I was hoping it would make sense to add to the nomad-client logs the evaluation that created an allocation, for easier client <> server correlation. On the "drain" theory I mentioned I will try to repro it and get better data, but for this occurance I can't easily Finally, the way we use the API is we purge the batch job before we reschedule it. I know we're doing things a little oddly. This is because we're using the API for lots of dynamic scheduling - I could give you more details on a call sometime if you are interested, but it means its hard to look at past versions. Ok summary for now:
|
Right, so you'll poll on Read Evaluation until the status is
Sure! The starting point for you is the For detecting whether there are any changes that need to be made, the eval passes into the Once we get the results back from (Just a heads up that this is one of the more computationally expensive areas of Nomad, so we try to keep IO to a minimum. So if you add logging that isn't just temporary hacks, do some benchmarking to make sure it's not going to drag scheduling performance down.) If you want to dig into the scheduler workflow further, I'd recommend my colleague @schmichael's excellent deep-dive on Nomad Under the Hood.
It'd be fine to do but it's not obvious to me where would be best to hook that in. Immediately after its creation? Or do we wait until after the allocation has been placed? But literally anywhere an allocation is getting logged, you could log its You may also be interested in keeping an eye on the Event Stream feature my colleague @drewbailey is working on in #9013. |
Edit: Hi @tgross , I spammed you a bit here as my mind was changed. If you want to ignore all the spam I think I finally found a real repro of a real bug here on my last comment here: #9054 (comment) Update so far: I haven't been able to repro this locally, but I can repro it in our servers by toggling drain
Our code will see the node leaving the drain state and try to schedule a job on it immediately and is told that there are no allocations for the evaluation that resulted, details included for timestamps:
and so we fail. However in spite of this, alloc
Oh but what is that eval ID?
Triggered by node update? What about the one that failed?
Triggered by job-register - the timestamps are the same so lets go to server logs for better details
I am not sure what to make of this so far, but it seems to me that the server creates 2 evals - one form our API call (89cf) which it decides doesn't get any allocation, and another magic one (d4a85) which it creates itself and that is the one that gets the allocations |
Is this the race? I'm just guessing, no science here: The eval broker sorts them according to priority
This line: will evaluate to false, because jobId != jobId is false So it will sort them by createindex
The smaller create index is first in the min heap so we the winner is What I can't figure out is how the job can exist for the eval |
Digging further I realized the code has a version check that makes job registration and eval creation atomic! i can repro the problem 100% with this diff
So im guessing we have a bug on our end where we fall into the |
Victory! I reprod it with 2 sleeps, here is the diff with explanation. I think this is a real bug? How to repro
Eventually the eval returned from the job registration endpoint will contain no allocs even though it did result in allocs
Analysis
We force this timing by delaying the dequeue by 2 seconds and delaying the job enqueue by 500ms SuggestionsThis seems buggy to me, specifically the node scheduling an eval that just references the job by name and not caring if the job may have since been purged completely and replaced. LMK if you disagree. I think the cleanest solution is one of
|
Thanks so much for this thorough investigation @benbuzbee! I walked thru your analysis, wrote most of a reply explaining why I didn't think it was quite right, double-checked the code again, and then realized you're totally right. 😀 Nice work here. This is interesting because it demonstrates a crack in the atomicity of registering+eval that we introduced in 0.12.1 in #8435. Which is that just because we wrote the eval to raft atomically with the job registration doesn't guarantee that the eval isn't going to lose a race against another eval that processes the same job.
At first glance that seems right, but it turns out we don't include the job Version intentionally because we need the eval to be able to tell us to clean up old versions of the job.
This is definitely what I'm thinking as well and I'm surprised this isn't checking if the allocation is terminal before creating the eval, but I'll need to spend some time trying to figure out all the edge cases here. It might take a little bit for me to circle back to that And although I don't think it helps in this case, make sure you're using the |
We can work around it by adding a random GUID to our job name so it doesn't match the job name the node finds when it wakes up. Appreciate your patience with this and feel free to let me know if you encounter any bumps when you get back to it - thanks! |
Hi @benbuzbee . Wow - lots of good and subtle details you analyzed here; sorry for taking so long to follow up. I will need chew on information for awhile, but wanted to share my initial thoughts (read: gut feelings) to share where I am and get your pushback. First I want to clarify the following:
Is A2 running J (Version 2) in this case? If so, I am not sure where the bug lays, and it might be that we provide a faulty mental model of evals. Job Evals is the mechanism to trigger a scheduler to reconcile desired vs actual allocations for a job and whether nomad should perform any follow up steps (e.g. starting new ones, shutting down, etc). For the most part, the scheduler does not factor in the cause of an eval (e.g. job register vs deregister, vs manual force) and it does a "full" review of the state to determine next actions. I suspect you can also trigger this condition is job eval command with the modification here. If you force eval the job, then immediately update it - the scheduler may place the job allocation as a result of the manual force eval rather than the job register eval as well. The atomicity we aimed for job registration+eval for in #8435 is about avoiding the scenario where a job registration update succeeds but eval enqueueing failing resulting in non schedule. As @tgross noted, it does not preclude the job registration placement happening due to other interleaved evals (e.g. due to drain, force-eval api, failed alloc requiring rescheduling). Interleaved evals can cause interesting other UX. If a client submits multiple updates to a job in rapid succession (e.g. Versions 2,3, and 4), it is possible for V2 eval to be actually placing V4 allocations, and have V3/V4 evals as no-op evals. If multiple users use the CLI to update the same Job, the users may get confusing eval info depending on ordering of Raft commit messages and broker dequeue activity! At this point, I want to preserve this scheduler behavior. To be honest, mostly because the change is risky as you point out regarding the first suggestion. Though, clearly the UX is non-ideal and I'd like to explore how much we can improve the UX while minimizing scheduler changes. I'll chew on this with teammates and will appreciate feedback. |
Nomad version
0.12.1
Operating system and Environment details
unix kernel 5.4.70; debian
Issue
We are using the API to schedule a batch job. This batch job is new - nomad has never seen it before in its current state - i know this because it has randomly generated data inside of it - though there may have been a previous job with the same name.
The API returns the following Evaluation structure indicating "no allocations were placed"
The nomad server DEBUG logs agree, saying it ignored 2 allocs and placed 0
(these logs are confusing, some have timestamps some dont, and there seem to be 2 very similar ones for a few lines)
Question 1: Why don't the logs or the API tell me WHY it wasn't placed? The current information is not very useful.
HOWEVER, I could confirm in the nomad client logs that it WAS placed, through some trickery. There is no proof, but you will have to trust me that I can tell by piecing together, that this run was allocation
e8699e91-91e6-e610-6d8b-0f18ba4996dd
Question 2: Is there some way to correlate the eval with an allocation in the server or client logs? I cannot find any.
it actually runs fine for a few minutes before it stops for some reason, but I dont care about that right now
Main bug: What happened? Why does the server thing no alloc was placed, when in fact I can see that one was?
Possibly related, but maybe not
The node exited drain mode less than a second before this happened
The text was updated successfully, but these errors were encountered: