-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad reallocated all jobs when one server lost connection in cluster #3840
Comments
Yikes! Sounds like AWS's network had a lot of issues. This behavior is to be expected during severe network issues: whenever client nodes are unable to heartbeat to a quorum of servers for a period of time, the servers will consider it There are a couple things you can do to try to prevent service outages during network issues like this:
I hope this helps but please reopen if you think there's an issue I'm missing. Thank you for including extremely helpful logs! |
@schmichael All allocation was placed in different AZ's since there was 1 client node in each AZ (by Autoscaling group rule), and since i'm using distinct_host constraint this leads to only one allocation in every AZ. Since i would want my services to come back up in the event of loosing an enitre AZ, it wouldn't happen if set constraint on AZ. When I've played around with nomad and killed quorum of servers destroying the cluster the clients has always continued to run its allocation until i kill it manually via docker. But now Nomad killed all allocations on one client node node, and 5 min later all allocation on the other 2 client nodes (66% of cluster capacity and thus causing downtime, We haven't scaled it to survive the loss of 2 AZ's). It doesn't feel like it's something it should do. I cannot find anything in the logs indicating that client-server connection has failed? Is that something that is logged? |
Great question! We need to document these logs or something. This line indicates leadership was lost:
These lines indicate the cluster trying to elect a new leader, having a bit of a hard time, but ultimately succeeding:
This line indicates a client node was
So
Sounds good!
Hm, the Do you have the IPs for each of the servers you posted? That would make it easier to read the logs and understand which server was the leader at each point and what their view of the cluster state view. Client logs may also be useful, especially if you have My best guess from the logs pasted are that because the network issues were intermittent it caused the worst possible conditions for the cluster: nodes would be lost and not be able to maintain a stable connection to a quorum of servers to reschedule the lost allocations. Do you have any idea of the scope of this network issue? Did Amazon post an update? Do you have any other services to correlate errors against? It's definitely possible that Nomad didn't behave optimally, but I'm afraid I can't determine that from the logs presented. Raising that heartbeat grace setting may avoid this issue in the further by simply not treating nodes as |
Hi. Thank you for your detailed response.
No this is the strange thing, Nothing else was affected, not our testing/staging environments running in the same AZ's, containers contacting DB's and each other didn't signal anything. No info about disturbance on AWS. Only Nomad deciding to kill off (almost) everything. ec2-34-243-167-122.eu-west-1.compute.amazonaws.com-consul-stdout.log |
Oh that is disturbing. Thanks for taking to time to post detailed logs! We'll try to dig in and see what happened. |
I'm facing similar problems now. During some outage and leadership lost i'm getting reallocation of all services. I'm using 2 DC in germany + 1 in france. Probably this is a network issue, but i don't think we should have full reallocations. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
If filing a bug please include the following:
Nomad version
Nomad v0.7.1 (0b295d3)
Operating system and Environment details
"Ubuntu 16.04.2 LTS"
Issue
Running a cluster with 3 Nomad masters and 3 clients in AWS eu-west-1, one per AZ
According to the logs one node seem to have lost connection to the other two servers, and this caused the cluster to reallocate all the tasks, first on 1 worker node, and then everything on 2 worker nodes 5 minutes later causing some downtime on allocations that were placed on only the 2 nodes.
No nodes was terminated, just seem like it was a network hickup.
Reproduction steps
Nope
Nomad Server logs (if appropriate)
server1.eu-west-1.compute.amazonaws.com.nomad_logs.txt
server3.eu-west-1.compute.amazonaws.com.nomad_logs.txt
server2.eu-west-1.compute.amazonaws.com.nomad_logs.txt
Nomad Client logs (if appropriate)
nothing interesting just lots of
client.gc: marking allocation c529a7e1-e5e9-2d6c-20de-405e9f10ce6a for GC
The text was updated successfully, but these errors were encountered: