-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad tries to call RecoverTask on an allocation it has GC'd minutes ago; restarts tasks it shouldn't #10901
Comments
I'm afraid that I didn't reproduce this issue or discern the triggering condition so far. Can you try to reproduce it again and send the tarball of the data dirs when the client/server is stopped the first time, and after the alloc got restarted? I'll continue inspecting the code for clues as well. |
Here's a tar file of a repro I dont know if this is the full repro of RecoverTask being called but the shortest way with that tar to repro something that i think is broken is:
This is using the latest nomad release (also packaged in the tar) |
Thank you @benbuzbee for the details. Also, I got to reproduce it with a Linux single server/client cluster! And I plan to have a fix soon! The problem is a variant of #5911, that disproportionately affect single server/client clusters. I'd be curious to know if it happened in production for you. The issue is that the client pulled an outdated/stale state view while the server was applying raft logs. The client fetches its allocations with AllowStale=true, recognizing the always present-delay in propagating info and assuming that index is monotonically increasing. However, if the server is just starting up and replying raft logs, the client may see an outdated view and start tasks that should not start. That also explains why the client "forgot" its killing of the task. In the latest repro logs, I noticed the following:
The following are the relevant log lines. Note particularly the leader election event and client's "updated allocations" log lines and their indexes.
|
From the logs in my original post we at least saw RecoverTask being called for a previously Destroy'd task, and one that has restarts set to 0 to boot. I am not very certain it is the exact same root cause, but I'm happy with the plan of merging your fix for this then re-evaluating if I still see it :) Appreciate the quick and deep look! Excited to get these new improvements into our cluster! |
This RecoverTask invocation is not surprising to me. The client keeps a local state info for an alloc until the server has GCed it. RecoverTask aims to load local state from data dir and reattach to any existing running processes; and while arguably we should handle dead tasks maybe a bit differently, we haven't witnessed the need for this optimizations/special-casing yet. Let us know how your experience goes! |
Ok so I understand from that that it's expected RecoverTask gets called and should just return an error since the task is dead, not necessarily a problem just makes it hard for our custom driver to figure out the difference between a real error in RecoverTask, and an inevitable error because nomad tried to Recover something it already asked us to Destroy. Not sure the best way to filter out the noise from error monitoring |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
1.0.4
Issue
We have a custom task driver and notice from our logs the following behavior. All of these logs have annotations
alloc_id=190deedc-2a2e-9687-f70c-d8fbfead9da2
. Our logs are marked with a + and nomad's own logs are marked with a ~As you can see, Nomad stops the thing at 20:29:25 and marks it for GC, but still tries to recover it when the agent starts up at 20:45:35, and marks it for GC again
It seems like strange behavior that it tries to recover something it has stopped and destroyed.
I also think this problem means a restart/reschedule = 0 job gets restarted anyway see repro for example
Reproduction steps
nomad job init
but set restarts = 0 and reschedules { attempts = 0, unlimited = false }Note from the logs:
The job is running even though it failed already and has 0/0 set as its policies:
It has forgotten about the times it was killed apparently
It force GC'd the old one for some reason then restarted it?
You cant see this from nomads logs alone, I can see from our custom driver logs that before all of this the second nomad instance called RecoverTask
Expected Result
Actual Result
Attachments
first-client-run.log
second-client-run.log
example.nomad.txt
The text was updated successfully, but these errors were encountered: