-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad client deregistering from consul #525
Comments
Our expectation is that in a production cluster nomad server and client don't run on the same node so I suspect some weirdness is caused by that. (We should still fix this, though.) I'm curious if this problem persists if you run these on separate nodes. I recall from reviewing this code that Nomad should only deregister services that it is tracking (that it has started) so if there are host-level services also registered with consul they should be left alone. If you are in AWS or another virtualized network environment you can use floating IPs for you nomad servers. Also you only need to join the nodes to the cluster once and they will check in and get the latest list of servers periodically, so IIRC they only need one valid IP once ever, and provided your cluster stays healthy they will maintain an updated list of servers over time. |
Thanks @cbednarski. It would seem that it's deregistering more than nomad-tracked services. In production, I would imagine it could be common that nomad servers are idle. This means 3+ m4.large aws instances are sitting idle burning holes in your pockets. By compacting nomad clients on nomad servers, it becomes cost-effective to run small clusters. |
The current design assumes that the Nomad Client is the sole user of the local Consul Agent. As @cbednarski mentioned we expect you would not run both the server and client on the same node. To support that would require a re-architecture of how Consul registration takes place. |
So like @dadgar said, in the current design Nomad Client registers/de-registers the running services with Consul on a node. The reason why we de-register everything and register only the processes we know are running is that we don't want any zombie services registered which is not running anymore. Say for example, we register a Redis container with Consul and the client crashes and it restarts and meanwhile the redis container dies we would be left with a zombie service unless we de-register everything which isn't running anymore on a node. I would suggest running Nomad client on a separate node. |
@BSick7 We discussed about this, and we would change the current implementation to de-register only services that the Nomad client doesn't know about and are tagged or have an id prefixed based on a convention. |
In a smaller cluster you could possibly run Nomad and Consul server nodes on the same machines but there's currently no way to account for the resource utilization here so Nomad can't effectively schedule workloads around this. As workload size grows both nomad and consul become fairly RAM hungry since they keep state in memory, and both are subject to varying CPU load depending on scheduling events, outages, partitions, recovery, etc. For stability and manageability the servers for both of these should have dedicated nodes. I think at the scale where it makes sense to operate a scheduler 3 dedicated nodes for Nomad should not be a significant cost. For instance if you run 3x m4.large and 15x c3.8xlarge or c4.8xlarge workers, Nomad is approximately 1.5% of your infrastructure spend. The consolidation benefits here are easily going to save you more than 1.5% so you're going to spend less overall. |
I could easily see zombie services getting out of control with little visibility. If it's really detrimental to run nomad in server mode and client mode, perhaps the following configuration would emit a warning. server {
enabled = true
}
client {
enabled = true
} Once I configured the following, nomad worked brilliantly revealing that node in both client {
servers = ["nomad.<subdomain>:4647"]
} |
Agreed, that makes sense.
@BSick7 I take it you used non-consul DNS for that? Or did I miss a step? |
Related to #510 |
@BSick7 and I are using Consul DNS ( This is a temporary approach for us until Nomad integrates with Atlas/Scada for discovery. |
@poll0rz Yes this is in-flight. |
@steve-jansen Thanks for the explanation. After Brad said he got it working I thought maybe he had changed something in his config but I wasn't sure of the details. |
@cbednarski I did change the config a little to use our route 53 dns instead of using consul dns. |
As a workaround, until this is fixed, I found that you can manually register the Nomad service with different node names (but the correct IP addresses) of your Nomad servers. This has the benefit of having Consul monitoring and DNS resolution. The downside is that you'll have additional "duplicate" nodes in your Consul node listing. Example:
|
@BSick7 We have a fix for this in master and soon releasing with 0.2.2 |
Great news! Thanks @diptanu |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
We are running consul server + nomad (running in server and client mode) on the same boxes that we call "managers". We also create "workers" that run consul client + nomad client.
We place a service registration in the consul config directory so that
nomad.service.consul
resolves to nomad servers. This allows us to configure nomad clients to configureservers = ["nomad.service.consul:4647"]
.Consul boots up, registers nomad, then deregisters nomad. After disabling nomad client mode and restarting consul and nomad, nomad remained in the consul service registration. Since
nomad.service.consul
doesn't resolve, the worker nodes are never able to connect to the nomad cluster.I believe I traced the culprit to https://github.com/hashicorp/nomad/blob/master/client/task_runner.go#L239-L240. This seems to be very intentional, yet makes little sense.
Could you add some documentation about this and better strategies for joining nomad clients?
The text was updated successfully, but these errors were encountered: