Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to update list of servers after replacing servers #1590

Closed
mlafeldt opened this issue Aug 15, 2016 · 9 comments
Closed

Unable to update list of servers after replacing servers #1590

mlafeldt opened this issue Aug 15, 2016 · 9 comments

Comments

@mlafeldt
Copy link
Contributor

mlafeldt commented Aug 15, 2016

Nomad version

v0.4.0

Operating system and Environment details

  • CoreOS stable 1068.8.0
  • AWS

Issue

Nomad clients are unable to register with servers configured via nomad client-config -update-servers after replacing all servers.

Reproduction steps

I have a working Nomad cluster setup consisting of 3 clients and these these 3 server nodes:

$ nomad server-members
Name                                            Address    Port  Status  Leader  Protocol  Build  Datacenter  Region
ip-10-8-3-95.eu-west-1.compute.internal.global  10.8.3.95  4648  alive   true    2         0.4.0  eu-west-1   global
ip-10-8-3-96.eu-west-1.compute.internal.global  10.8.3.96  4648  alive   false   2         0.4.0  eu-west-1   global
ip-10-8-4-30.eu-west-1.compute.internal.global  10.8.4.30  4648  alive   false   2         0.4.0  eu-west-1   global

Afterwards I terminate the 3 cluster nodes and recreate them from scratch:

$ nomad server-members
Name                                             Address     Port  Status  Leader  Protocol  Build  Datacenter  Region
ip-10-8-3-123.eu-west-1.compute.internal.global  10.8.3.123  4648  alive   true    2         0.4.0  eu-west-1   global
ip-10-8-4-93.eu-west-1.compute.internal.global   10.8.4.93   4648  alive   false   2         0.4.0  eu-west-1   global
ip-10-8-4-94.eu-west-1.compute.internal.global   10.8.4.94   4648  alive   false   2         0.4.0  eu-west-1   global

However, now I cannot get the clients to register with the new servers, even after running nomad client-config -update-servers. In fact, the agent still tries to contact the old/dead server nodes:

$ nomad client-config -update-servers 10.8.3.123:4647 10.8.4.93:4647 10.8.4.94:4647
Updated server list
$ nomad client-config -servers
10.8.3.123:4647
10.8.4.30:4647
10.8.3.96:4647
10.8.4.93:4647
10.8.3.95:4647
10.8.4.94:4647

Nomad Client logs

From what I can see in the client logs, the agent still tries to connect to the old/dead cluster leader:

Aug 15 14:43:38 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:37.889545 [DEBUG] client.rpcproxy: pinging server "10.8.3.95:4647 (tcp:10.8.3.95:4647)" failed: failed to get conn: dial tcp 10.8.3.95:4647: getsockopt: no route to host
Aug 15 14:43:41 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:40.895560 [DEBUG] client.rpcproxy: pinging server "10.8.3.95:4647 (tcp:10.8.3.95:4647)" failed: failed to get conn: dial tcp 10.8.3.95:4647: getsockopt: no route to host
Aug 15 14:43:44 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:43.901578 [DEBUG] client.rpcproxy: pinging server "10.8.3.95:4647 (tcp:10.8.3.95:4647)" failed: failed to get conn: dial tcp 10.8.3.95:4647: getsockopt: no route to host
Aug 15 14:43:47 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:46.907597 [DEBUG] client.rpcproxy: pinging server "10.8.3.95:4647 (tcp:10.8.3.95:4647)" failed: failed to get conn: dial tcp 10.8.3.95:4647: getsockopt: no route to host
Aug 15 14:43:50 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:49.913596 [DEBUG] client.rpcproxy: pinging server "10.8.3.95:4647 (tcp:10.8.3.95:4647)" failed: failed to get conn: dial tcp 10.8.3.95:4647: getsockopt: no route to host
Aug 15 14:43:53 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:52.919560 [DEBUG] client.rpcproxy: pinging server "10.8.3.95:4647 (tcp:10.8.3.95:4647)" failed: failed to get conn: dial tcp 10.8.3.95:4647: getsockopt: no route to host
Aug 15 14:43:53 ip-10-8-3-165.eu-west-1.compute.internal nomad-client:      2016/08/15 12:43:52.919592 [DEBUG] client.rpcproxy: No healthy servers during rebalance, aborting

It appears that the agent does not even attempt to connect to all servers returned by nomad client-config -servers.

Background

We want our infrastructure to be self-healing. While Nomad provides retry_join on the server side, there's no such thing for clients. I know that servers will push the current list of healthy servers to clients. However, this does not work if all server nodes are replaced at once or if the client nodes are bootstrapped before any server. That's why we want to periodically push discovered servers via the /v1/agent/servers endpoint on clients.

/cc @denderello

@mlafeldt mlafeldt changed the title nomad client-config -update-servers fails after replacing servers Unable to update list of servers after replacing servers Aug 15, 2016
@mlafeldt
Copy link
Contributor Author

PS: Of course, we normally do rolling updates of both our server and client clusters. Having to replace the entire server cluster is still a scenario I'd like to handle (by decoupling both clusters as much as possible).

@dadgar
Copy link
Contributor

dadgar commented Aug 16, 2016

So it was actually using the update list, the problem was that the client was not reregistering itself since the normal path is register than just heartbeat. So when the new servers came up they were rejecting its heartbeats.

@mlafeldt
Copy link
Contributor Author

I'm not sure about the internals and what is going wrong. In the logs, I can't see that the new servers are contacted at all. What I can say is that we need to restart the client agent and give it the new server list for it to register successfully. Updating the list via nomad client-config -update-servers does not work here.

@mlafeldt
Copy link
Contributor Author

For the time being, we managed to decouple deployment of Nomad clients from servers by using a watchdog unit that periodically checks whether there's a valid server among the list reported by nomad client-config -servers. If not, we update the client configuration file and restart the agent.

I still think that nomad client-config -update-servers should support this use case, so that people aren't forced to use Consul.

@dadgar
Copy link
Contributor

dadgar commented Aug 16, 2016

Are you saying this after the PR I opened

@mlafeldt
Copy link
Contributor Author

mlafeldt commented Aug 16, 2016

Ah! Totally missed that one. Thanks.

I can run some tests with the PR on our cluster. Just need to add a way to roll out custom builds. Is this ready for testing?

(We're not going to install non-released Nomad builds in production, so the watchdog workaround will still be required for some time.)

@dadgar
Copy link
Contributor

dadgar commented Aug 16, 2016

Yeah it is ready! This will fix the case of having to restart the client if
all the servers are rolled but if the client gets a heartbeat, the set of
servers there will override what was set in the cli.

In this way some additional work needs to be done to make the update set
the list of servers and not append to it
On Tue, Aug 16, 2016 at 8:46 AM Mathias Lafeldt [email protected]
wrote:

Ah! Totally missed that one. Thanks.

I can run some tests with the PR on our cluster. Just need to add a way to
roll out custom builds. Is this ready for testing?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1590 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA_9amOsPc_kwJaco6Xx93gFBGcwTbtxks5qgdtbgaJpZM4JkXUs
.

@mlafeldt
Copy link
Contributor Author

@dadgar I'm happy to report that your fix actually works for us.

The test scenario:

  • Have a working server cluster
  • Bootstrap a client cluster with custom Nomad version
  • Re-create server cluster from scratch
  • Run nomad client-config -update-servers to tell clients about new nodes

After the heartbeat, the clients successfully re-registered with the servers and showed up in nomad node-status as well.

With this fix in hand, we're able to use a systemd timer that periodically pushes discovered servers via nomad client-config -update-servers as a fallback mechanism to the initial discovery on boot-up.

Thanks!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants