Nomad agent process is not exiting on fatal errors #10486

rubenharutyunov · 2021-04-30T16:25:38Z

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

Servers - ubuntu-focal-20.04-amd64-server-* in AWS
Clients - Windows_Server-2019-English-Full-ContainersLatest-* in AWS

Issue

We are running nomad clients in Windows using NSSM. Our Nomad jobs are consuming a big amount of memory that causes Nomad agents to crash with fatal error: out of memory message. In case of a crash, the service should be restarted, however, it doesn't happen sometimes. There are no new logs coming from Nomad, the client is reported as down, metrics endpoint is not working, but Windows service reports itself as Running. In such cases there's a Nomad process running (sometimes even two processes).
The manual fix is to kill the process(es) and restart the Nomad agent service. I think the cause of the issue can be that the agent sometimes not exiting in situations it can't recover from (like OOM, in this case).

Reproduction steps

The issue is really hard to reproduce for us. We have ~3000 nodes in Nomad and this issue happens only 3-4 times per week. I tried to reproduce it by filling a memory and causing fatal error: out of memory but the agent restarted itself successfully then.

Expected Result

Nomad agent always exits on unrecoverable errors

Actual Result

There are running Nomad processes after unrecoverable errors

The text was updated successfully, but these errors were encountered:

tgross · 2021-04-30T18:15:29Z

Hi @K-dot! A couple things that come to mind from your issue.

Once Nomad has crashed, it's the responsibility of the service manager to restart it; there's not much Nomad can do at that point. Is the service in a crash-loop where it's restarting over and over again? I'd be particularly interested to see if you're seeing the "funlock" errors that we're seeing in #10086. If that's the case, unfortunately I don't have a good answer for you at the moment other than to let you know we're investigating still.

In such cases there's a Nomad process running (sometimes even two processes).

Are these the Nomad agent process, or might they be the log shipper process (which is the same binary)?

rubenharutyunov · 2021-05-05T18:42:07Z

Hi @tgross. We have a couple of "funlock" errors but I don't think they actually cause this problem because the problem happens more often.

Once Nomad has crashed, it's the responsibility of the service manager to restart it; there's not much Nomad can do at that point. Is the service in a crash-loop where it's restarting over and over again?

Sure, the service manager is configured to restart the service on crash but the problem is that there's an active Nomad process after the crash and the service manager sees the service as running because of that. So my assumption is maybe the Nomad process doesn't exit in such cases.

Are these the Nomad agent process, or might they be the log shipper process (which is the same binary)?

How can I differentiate them? I see only nomad.exe. If the log shipper stays after the main process has crashed - that can be the reason of this problem.

notnoop · 2021-06-02T12:38:27Z

@K-dot Sorry for the long delay. I'm investigating this issue further now. The following will be immensely helpful for debugging the issue for the next time it happens:

The full logs of the agent after failed restart. To ensure you get all the logs, I'd suggest running the nomad agent CLI command directly, e.g. nomad agent -config c:\...\config.hcl and capturing the output.
- In particular, I'd expect seeing a Error starting agent: lines that will provide some leads for the reason of the crash-loops.
The "corrupted" state.db file from the client data-dir directory, if it doesn't contain super sensitive info. You can send it to [email protected] or our Enterprise support.

Thank you so much for your patience.

tgross · 2024-10-18T19:24:19Z

I'm doing some other Windows investigations and it looks like this issue has been waiting on more information for a long time. Going to close this out as unresolved.

rubenharutyunov added the type/bug label Apr 30, 2021

tgross added the theme/platform-windows label Apr 30, 2021

tgross self-assigned this Apr 30, 2021

tgross added the stage/waiting-reply label Apr 30, 2021

tgross removed the stage/waiting-reply label May 20, 2021

notnoop self-assigned this Jun 2, 2021

tgross removed their assignment Jun 3, 2021

tgross added the stage/waiting-reply label Jun 3, 2021

tgross unassigned notnoop Nov 8, 2021

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

tgross closed this as not planned Won't fix, can't repro, duplicate, stale Oct 18, 2024

github-project-automation bot moved this from Needs Roadmapping to Done in Nomad - Community Issues Triage Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad agent process is not exiting on fatal errors #10486

Nomad agent process is not exiting on fatal errors #10486

rubenharutyunov commented Apr 30, 2021

tgross commented Apr 30, 2021

rubenharutyunov commented May 5, 2021 •

edited

Loading

notnoop commented Jun 2, 2021 •

edited

Loading

tgross commented Oct 18, 2024

Nomad agent process is not exiting on fatal errors #10486

Nomad agent process is not exiting on fatal errors #10486

Comments

rubenharutyunov commented Apr 30, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

tgross commented Apr 30, 2021

rubenharutyunov commented May 5, 2021 • edited Loading

notnoop commented Jun 2, 2021 • edited Loading

tgross commented Oct 18, 2024

rubenharutyunov commented May 5, 2021 •

edited

Loading

notnoop commented Jun 2, 2021 •

edited

Loading