-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad allows two copies of itself on one machine #5942
Comments
@the-maldridge were both Nomads using the same config file? I did a quick test and the second agent fails immediately because of port conflicts. If the Nomad agent can't bind to the configured port that fails fast. So I am curious about the config files used by the two copies of Nomad you saw. We don't prevent two different nomad agents running on the same host if they use different data directories/different config files. We use configs like that for local testing, though its not a recommended thing to do in production. |
There is only one set of configs on the machine, so I can only assume that the task was flapping and not fully initializing. The problem appears to be that it initialized enough to talk to Consul, which caused tasks to flap while it restarted. As a side effect, every time it started it would spin off phantom logmons that never die. |
@the-maldridge Thanks for clarifying, we'll triage this into our short term roadmap. I was able to reproduce with attempting to run with duplicated client agents. |
@preetapan for my curiosity, do you also see phantom logmons? I suspect that those are related to the bug with not-fully-gc'd allocs (#5945). |
I built from master and did not see phantom logmons so I think PR #5890 fixed it. |
I think this has been long resolved. I haven't seen it on any even remotely modern version. Closing for now. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Operating system and Environment details
Alpine Linux v3.9.4 x86_64 Virtualized on AWS EC2.
Nomad is running under supervise-daemon with the following openrc service file:
Issue
2 copies of Nomad can be running at the same time:
While most of those 134 spare processes are logmons which I'm pretty sure have leaked, there is also a full copy of nomad running that's been re-parented to init directly. This caused a very hard to diagnose bug which we caught by seeing that a task was being signaled very quickly in a loop that its template was changing. This eventually led to finding tasks on a single machine that seemed to be flapping but docker didn't have restarts logged. This was when we realized that there were two complete copies of nomad running on the machine, and they were both sending conflicting data back to consul.
This shouldn't have been possible in the first place, since the boltdb should have been locked by the first copy, so I'm really not sure how the second nomad was running. I'd suggest adding a check in very early initialization for a lock to ensure that there is exactly one nomad running on the machine.
Reproduction steps
I'm not sure I could reproduce this if I tried, but I am aware of this happening in my test cluster several times, and now that I've seen it in prod I'm thoroughly spooked.
The text was updated successfully, but these errors were encountered: