Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad allows two copies of itself on one machine #5942

Closed
the-maldridge opened this issue Jul 8, 2019 · 7 comments
Closed

Nomad allows two copies of itself on one machine #5942

the-maldridge opened this issue Jul 8, 2019 · 7 comments
Labels

Comments

@the-maldridge
Copy link

Nomad version

$ nomad version
Nomad v0.9.3

Operating system and Environment details

Alpine Linux v3.9.4 x86_64 Virtualized on AWS EC2.

Nomad is running under supervise-daemon with the following openrc service file:

#!/sbin/openrc-run

description="Nomad job scheduler"
supervisor=supervise-daemon
command=/usr/local/bin/nomad
command_args_foreground="agent -config /etc/nomad.hcl -config /etc/nomad.d/ &>/var/log/nomad.log"
pidfile="/run/${RC_SVCNAME}.pid"

depend() {
    need net
}

Issue

2 copies of Nomad can be running at the same time:

$ pstree
init-+-acpid
     |-chronyd
     |-dockerd---containerd
     |-nomad---134*[nomad]
     |-sshd---sshd---sshd---sh---pstree
     |-supervise-daemo---consul
     |-supervise-daemo---nomad
     |-syslogd
     `-udhcpc

While most of those 134 spare processes are logmons which I'm pretty sure have leaked, there is also a full copy of nomad running that's been re-parented to init directly. This caused a very hard to diagnose bug which we caught by seeing that a task was being signaled very quickly in a loop that its template was changing. This eventually led to finding tasks on a single machine that seemed to be flapping but docker didn't have restarts logged. This was when we realized that there were two complete copies of nomad running on the machine, and they were both sending conflicting data back to consul.

This shouldn't have been possible in the first place, since the boltdb should have been locked by the first copy, so I'm really not sure how the second nomad was running. I'd suggest adding a check in very early initialization for a lock to ensure that there is exactly one nomad running on the machine.

Reproduction steps

I'm not sure I could reproduce this if I tried, but I am aware of this happening in my test cluster several times, and now that I've seen it in prod I'm thoroughly spooked.

@preetapan
Copy link
Contributor

@the-maldridge were both Nomads using the same config file? I did a quick test and the second agent fails immediately because of port conflicts. If the Nomad agent can't bind to the configured port that fails fast. So I am curious about the config files used by the two copies of Nomad you saw.

We don't prevent two different nomad agents running on the same host if they use different data directories/different config files. We use configs like that for local testing, though its not a recommended thing to do in production.

@the-maldridge
Copy link
Author

There is only one set of configs on the machine, so I can only assume that the task was flapping and not fully initializing. The problem appears to be that it initialized enough to talk to Consul, which caused tasks to flap while it restarted. As a side effect, every time it started it would spin off phantom logmons that never die.

@preetapan
Copy link
Contributor

@the-maldridge Thanks for clarifying, we'll triage this into our short term roadmap. I was able to reproduce with attempting to run with duplicated client agents.

@the-maldridge
Copy link
Author

@preetapan for my curiosity, do you also see phantom logmons? I suspect that those are related to the bug with not-fully-gc'd allocs (#5945).

@preetapan
Copy link
Contributor

I built from master and did not see phantom logmons so I think PR #5890 fixed it.

@tgross tgross added stage/needs-verification Issue needs verifying it still exists and removed stage/needs-investigation labels Jan 26, 2021
@the-maldridge
Copy link
Author

I think this has been long resolved. I haven't seen it on any even remotely modern version. Closing for now.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants