-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad agent process is not exiting on fatal errors #10486
Comments
Hi @K-dot! A couple things that come to mind from your issue. Once Nomad has crashed, it's the responsibility of the service manager to restart it; there's not much Nomad can do at that point. Is the service in a crash-loop where it's restarting over and over again? I'd be particularly interested to see if you're seeing the "funlock" errors that we're seeing in #10086. If that's the case, unfortunately I don't have a good answer for you at the moment other than to let you know we're investigating still.
Are these the Nomad agent process, or might they be the log shipper process (which is the same binary)? |
Hi @tgross. We have a couple of "funlock" errors but I don't think they actually cause this problem because the problem happens more often.
Sure, the service manager is configured to restart the service on crash but the problem is that there's an active Nomad process after the crash and the service manager sees the service as running because of that. So my assumption is maybe the Nomad process doesn't exit in such cases.
How can I differentiate them? I see only |
@K-dot Sorry for the long delay. I'm investigating this issue further now. The following will be immensely helpful for debugging the issue for the next time it happens:
Thank you so much for your patience. |
I'm doing some other Windows investigations and it looks like this issue has been waiting on more information for a long time. Going to close this out as unresolved. |
Nomad version
Nomad v1.0.1 (c9c68aa)
Operating system and Environment details
Servers -
ubuntu-focal-20.04-amd64-server-*
in AWSClients -
Windows_Server-2019-English-Full-ContainersLatest-*
in AWSIssue
We are running nomad clients in Windows using NSSM. Our Nomad jobs are consuming a big amount of memory that causes Nomad agents to crash with
fatal error: out of memory
message. In case of a crash, the service should be restarted, however, it doesn't happen sometimes. There are no new logs coming from Nomad, the client is reported as down, metrics endpoint is not working, but Windows service reports itself asRunning
. In such cases there's a Nomad process running (sometimes even two processes).The manual fix is to kill the process(es) and restart the Nomad agent service. I think the cause of the issue can be that the agent sometimes not exiting in situations it can't recover from (like OOM, in this case).
Reproduction steps
The issue is really hard to reproduce for us. We have ~3000 nodes in Nomad and this issue happens only 3-4 times per week. I tried to reproduce it by filling a memory and causing
fatal error: out of memory
but the agent restarted itself successfully then.Expected Result
Nomad agent always exits on unrecoverable errors
Actual Result
There are running Nomad processes after unrecoverable errors
The text was updated successfully, but these errors were encountered: