-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seems that after leader fail, some agents may forgot about servers except leader, what leads this nodes to miss they heartbeats #4666
Comments
Can you share the server logs for |
Here only this logs from
and absolutely silence at moment of crash. May be logs from server doesn't reach collector, but this is all we have, aws forcedly terminate that server and it volume is lost our client config is separate by files: client.hcl
acl.hcl
vault.hcl
consul.hcl
advertise.hcl
replicator.hcl
Also i remember that the same situation happens on another cluster, at moment when we began debug nomad leader server with delve debugger(so it hung for e long time), but we not try to reproduce this in this way |
@dadgar this situation may be reproduced if you hung nomad server by debugger in our case this is 100% reproducible (we do for example on test vagrant stand we hung one of nomad servers(we havd separate VM between servers and clients) we see in client logs:
as you can see
3 nodes lost they ttl(on test stand we have only 4 nodes) |
I think that a reason of this is lock held by |
@dadgar The problem was more complex than simply remove The key ideas is:
After this changes we can't reproduce this issue on test stand. But this PR solve only one part of problem initialy described, and doesn't explain why there was only one server in client communications. As we add additional logging in Also we found that autorebalancing of nomad server on client doesn't launched - method |
It seems that we also have the issue. Probably it's been starting since we upgraded Nomad to v0.8.4. Due to the issue we have constantly restarting tasks on random clients when one of the server is shut-downed. |
@AlexanderZagaevskiy as workaround you may increase |
@tantra35 Our cluster is very demanding on fast fault detection, thus the solution with Nevertheless, сlose look at the issue and @tantra35 's PR allowed to fix it.
As a temporary solution, we decreased keep alive interval for yamux sessions, got servers list shuffled only for incoming list that differed from already known one and turned autorebalancing of nomad servers on. |
@AlexanderZagaevskiy i agree with your timing settings for yamux, but why you have so often times of non responsive servers? Do you have any telemetry? |
Our QA engineers just checked fault tolerance feature and turned servers off or unplugged network on purpose. One of test cases. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.8.4 (dbee1d7)
Issue
We see this situation time to time, when one of nomad servers hung due hardware failure or network connectivity problems. At last time was follow situation due aws fail, one of nomad servers hung, and exactly this server was leader:
By this logs lines we conclude that node with ip
172.29.20.70
, was leader, while aws autoscale begin shutdown failed node, and bring up new(this process took about 5-7 minutes), the remaining two elect new leader, but 2 nomad agents forgot about remaining server and still communicate only with failed leader, only after 15-20 seconds those agents see server list and begin send heartbeats to proper RPC server, but this time was enough for the nomad server(new leader) conclude that this agents miss they hearbeats TTL and begin reallocate all allocation placed on that agents. Follow logs demonstrate situation:As you can see no any server rotation was made after each failed RPC(so we conclude that agent forgot about all server except
172.29.20.70
), as it must be done due code https://github.com/hashicorp/nomad/blob/master/client/rpc.go#L76For now we think that potential reason could be due absolute setting new server list in follow code https://github.com/hashicorp/nomad/blob/master/client/client.go#L1509
Our hypothesis is. Before fully hung, leader forget about another servers and provide this info to some nomad clients(this agents not lucky, and they made RPC request to this faulty server), after that nomad server fully hung and stop processing RPC requests, so unlucky clients have not any chance to communicate with healthy servers. But it is strange that consul discovery doesn't helps, and also strange that distance betwen follow erro records
and
is more than 15 seconds, but in theory this distance should be very small
The text was updated successfully, but these errors were encountered: