-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to update list of servers after replacing servers #1590
Comments
PS: Of course, we normally do rolling updates of both our server and client clusters. Having to replace the entire server cluster is still a scenario I'd like to handle (by decoupling both clusters as much as possible). |
So it was actually using the update list, the problem was that the client was not reregistering itself since the normal path is register than just heartbeat. So when the new servers came up they were rejecting its heartbeats. |
I'm not sure about the internals and what is going wrong. In the logs, I can't see that the new servers are contacted at all. What I can say is that we need to restart the client agent and give it the new server list for it to register successfully. Updating the list via |
For the time being, we managed to decouple deployment of Nomad clients from servers by using a watchdog unit that periodically checks whether there's a valid server among the list reported by I still think that |
Are you saying this after the PR I opened |
Ah! Totally missed that one. Thanks. I can run some tests with the PR on our cluster. Just need to add a way to roll out custom builds. Is this ready for testing? (We're not going to install non-released Nomad builds in production, so the watchdog workaround will still be required for some time.) |
Yeah it is ready! This will fix the case of having to restart the client if In this way some additional work needs to be done to make the update set
|
@dadgar I'm happy to report that your fix actually works for us. The test scenario:
After the heartbeat, the clients successfully re-registered with the servers and showed up in With this fix in hand, we're able to use a systemd timer that periodically pushes discovered servers via Thanks! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
v0.4.0
Operating system and Environment details
Issue
Nomad clients are unable to register with servers configured via
nomad client-config -update-servers
after replacing all servers.Reproduction steps
I have a working Nomad cluster setup consisting of 3 clients and these these 3 server nodes:
Afterwards I terminate the 3 cluster nodes and recreate them from scratch:
However, now I cannot get the clients to register with the new servers, even after running
nomad client-config -update-servers
. In fact, the agent still tries to contact the old/dead server nodes:Nomad Client logs
From what I can see in the client logs, the agent still tries to connect to the old/dead cluster leader:
It appears that the agent does not even attempt to connect to all servers returned by
nomad client-config -servers
.Background
We want our infrastructure to be self-healing. While Nomad provides
retry_join
on the server side, there's no such thing for clients. I know that servers will push the current list of healthy servers to clients. However, this does not work if all server nodes are replaced at once or if the client nodes are bootstrapped before any server. That's why we want to periodically push discovered servers via the/v1/agent/servers
endpoint on clients./cc @denderello
The text was updated successfully, but these errors were encountered: