Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad agent process is not exiting on fatal errors #10486

Closed
rubenharutyunov opened this issue Apr 30, 2021 · 4 comments
Closed

Nomad agent process is not exiting on fatal errors #10486

rubenharutyunov opened this issue Apr 30, 2021 · 4 comments

Comments

@rubenharutyunov
Copy link

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

Servers - ubuntu-focal-20.04-amd64-server-* in AWS
Clients - Windows_Server-2019-English-Full-ContainersLatest-* in AWS

Issue

We are running nomad clients in Windows using NSSM. Our Nomad jobs are consuming a big amount of memory that causes Nomad agents to crash with fatal error: out of memory message. In case of a crash, the service should be restarted, however, it doesn't happen sometimes. There are no new logs coming from Nomad, the client is reported as down, metrics endpoint is not working, but Windows service reports itself as Running. In such cases there's a Nomad process running (sometimes even two processes).
The manual fix is to kill the process(es) and restart the Nomad agent service. I think the cause of the issue can be that the agent sometimes not exiting in situations it can't recover from (like OOM, in this case).

Reproduction steps

The issue is really hard to reproduce for us. We have ~3000 nodes in Nomad and this issue happens only 3-4 times per week. I tried to reproduce it by filling a memory and causing fatal error: out of memory but the agent restarted itself successfully then.

Expected Result

Nomad agent always exits on unrecoverable errors

Actual Result

There are running Nomad processes after unrecoverable errors

@tgross
Copy link
Member

tgross commented Apr 30, 2021

Hi @K-dot! A couple things that come to mind from your issue.

Once Nomad has crashed, it's the responsibility of the service manager to restart it; there's not much Nomad can do at that point. Is the service in a crash-loop where it's restarting over and over again? I'd be particularly interested to see if you're seeing the "funlock" errors that we're seeing in #10086. If that's the case, unfortunately I don't have a good answer for you at the moment other than to let you know we're investigating still.

In such cases there's a Nomad process running (sometimes even two processes).

Are these the Nomad agent process, or might they be the log shipper process (which is the same binary)?

@rubenharutyunov
Copy link
Author

rubenharutyunov commented May 5, 2021

Hi @tgross. We have a couple of "funlock" errors but I don't think they actually cause this problem because the problem happens more often.

Once Nomad has crashed, it's the responsibility of the service manager to restart it; there's not much Nomad can do at that point. Is the service in a crash-loop where it's restarting over and over again?

Sure, the service manager is configured to restart the service on crash but the problem is that there's an active Nomad process after the crash and the service manager sees the service as running because of that. So my assumption is maybe the Nomad process doesn't exit in such cases.

Are these the Nomad agent process, or might they be the log shipper process (which is the same binary)?

How can I differentiate them? I see only nomad.exe. If the log shipper stays after the main process has crashed - that can be the reason of this problem.

@notnoop notnoop self-assigned this Jun 2, 2021
@notnoop
Copy link
Contributor

notnoop commented Jun 2, 2021

@K-dot Sorry for the long delay. I'm investigating this issue further now. The following will be immensely helpful for debugging the issue for the next time it happens:

  • The full logs of the agent after failed restart. To ensure you get all the logs, I'd suggest running the nomad agent CLI command directly, e.g. nomad agent -config c:\...\config.hcl and capturing the output.

    • In particular, I'd expect seeing a Error starting agent: lines that will provide some leads for the reason of the crash-loops.
  • The "corrupted" state.db file from the client data-dir directory, if it doesn't contain super sensitive info. You can send it to [email protected] or our Enterprise support.

Thank you so much for your patience.

@tgross
Copy link
Member

tgross commented Oct 18, 2024

I'm doing some other Windows investigations and it looks like this issue has been waiting on more information for a long time. Going to close this out as unresolved.

@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale Oct 18, 2024
@github-project-automation github-project-automation bot moved this from Needs Roadmapping to Done in Nomad - Community Issues Triage Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants