Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad shows node as "ready" even though it is down (and nomad knows it's down) #3458

Closed
dansteen opened this issue Oct 27, 2017 · 12 comments
Closed

Comments

@dansteen
Copy link

dansteen commented Oct 27, 2017

Nomad version

Nomad v0.6.3

Operating system and Environment details

debian

Issue

I have a number of nomad client boxes that were shut down. However, nomad still shows the node as being in the "ready" state when I do a nomad node-status. However, when I list the allocations on the node, they show as being lost with a message alloc is lost since its node is down. Clearly nomad knows the node is down, but is not registering it as such. It is also trying to create new allocations on those nodes which causes all sorts of problems (I have put them in drain state so that things can keep working).

Note that this is very similar to #3072 . However, in that case we tracked the issue to node data being too large. That does not seem to be the cause here.

Here are the details:

When I do a nomad node-status I get:

error fetching node stats (HINT: ensure Client.Advertise.HTTP is set): Get https://internal-cluster-0df.stag.awse.traitify.com:4646/v1/client/stats?region=global: dial tcp 10.1.200.206:4646: i/o timeout
ID      = d6988f10
Name    = internal-cluster-0df.stag.awse
Class   = <none>
DC      = awse
Drain   = true
Status  = ready
Drivers = exec

Allocated Resources
CPU         Memory       Disk        IOPS
0/2500 MHz  0 B/3.7 GiB  0 B/16 GiB  0/0

Allocation Resource Utilization
CPU         Memory
0/2500 MHz  0 B/3.7 GiB

error fetching node stats (HINT: ensure Client.Advertise.HTTP is set): actual resource usage not present

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
4b585ab6  d6988f10  app         24       stop     lost      10/25/17 14:48:28 EDT

Notice it shows a status of ready, but under allocations it shows lost. When I do an alloc-status on that allocation I get:

ID                  = 4b585ab6
Eval ID             = a52b271b
Name                = gambit-stag.app[1]
Node ID             = d6988f10
Job ID              = gambit-stag
Job Version         = 24
Client Status       = lost
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc is lost since its node is down
Created At          = 10/25/17 14:48:28 EDT
Deployment ID       = 3b1dc798
Deployment Health   = healthy

Couldn't retrieve stats (HINT: ensure Client.Advertise.HTTP is set): Get https://internal-cluster-0df.stag.awse.traitify.com:4646/v1/client/allocation/4b585ab6-f984-c164-a852-2fd037b719c9/stats?region=global: dial tcp 10.1.200.206:4646: i/o timeout
...

Notice that it tells us alloc is lost since its node is down

Thanks!

@dadgar
Copy link
Contributor

dadgar commented Oct 27, 2017

@dansteen Could you provide all your server logs? Would you mind doing curl -XPUT http://127.0.0.1:4646/v1/system/gc before collecting the logs?

@dansteen
Copy link
Author

dansteen commented Oct 27, 2017

Hi @dadgar thanks for the response!

I ran the gc and gathered some logs. We have 3 nomad servers in our cluster, so I have included logs from each one.

One other thing that I noticed is that we seem to have an extra server listed in nomad operator raft list-peers:

Node              ID                 Address            State     Voter
nomad-0ea.global  10.1.99.123:4647   10.1.99.123:4647   follower  true
nomad-02f.global  10.1.200.13:4647   10.1.200.13:4647   follower  true
nomad-0dc.global  10.1.200.16:4647   10.1.200.16:4647   leader    true
(unknown)         10.1.200.164:4647  10.1.200.164:4647  follower  true

compared to the output of nomad server-members

Name              Address      Port  Status  Leader  Protocol  Build  Datacenter  Region
nomad-02f.global  10.1.200.13  4648  alive   false   2         0.6.3  awse        global
nomad-0dc.global  10.1.200.16  4648  alive   true    2         0.6.3  awse        global
nomad-0ea.global  10.1.99.123  4648  alive   false   2         0.6.3  awse        global

I don't know where that came from, or what box it is since it doesn't seem to actually exist at this point. Could that be related to the problem?

nomad1.log
nomad2.log
nomad3.log

@dadgar
Copy link
Contributor

dadgar commented Oct 27, 2017

@dansteen Ah I should have asked for you to up the log level to debug first. Would you mind doing that and then letting it run like that for ~20ish minutes, GC and then send those? You can bump the log level in the config and just SIGHUP the servers.

@dadgar
Copy link
Contributor

dadgar commented Oct 27, 2017

I would also do a nomad operator raft remove-peer -peer-address=10.1.200.164:4647 on the down server

@dansteen
Copy link
Author

@dadgar sorry about that, I should have known. Anyway, here are the fancy new logs:
nomad3.log
nomad2.log
nomad1.log

@dadgar
Copy link
Contributor

dadgar commented Oct 27, 2017

@dansteen Can you show me the metric nomad.heartbeat.active and the output of curl 127.0.0.1:4646/v1/nodes.

You can get the heartbeat metric either from your metrics sink or if you aren't exporting them if you send SIGUSR1 on the leader node it will output the metrics to the STDERR

@dansteen
Copy link
Author

dansteen commented Oct 29, 2017

Unfortunately (or fortunately?) it looks like nomad finally figured out that those nodes were down. Here is the output of curling the nodes:

curl_output.json.txt

Here are graphs of the nomad.heartbeat.active metric. I hadn't set disable_hostname = true yet, so the metric is divided across my three boxes. That's annoying, but it brings to light that it seems like my active nomad node is changing back and fourth fairly frequently. Is that normal?

screen shot 2017-10-28 at 11 01 56 pm

Interestingly, it looks like nomad finally figured out that a whole bunch of boxes were down. (according to the graphs above, that may have happened around 20:30, if I'm understanding the information correctly). Either way, here is the output of nomad node-status late on friday:

ID        DC    Name                            Class   Drain  Status
624984b6  awse  internal-cluster-0ba.stag.awse  <none>  false  ready
7afd5aec  awse  internal-cluster-035.stag.awse  <none>  false  ready
1171c72f  awse  woofound-test-kitchen           <none>  false  initializing
add73f90  awse  woofound-test-kitchen           <none>  false  initializing
2e753b69  awse  woofound-test-kitchen           <none>  false  ready
9cf10e88  awse  woofound-test-kitchen           <none>  false  ready
a424bb84  awse  woofound-test-kitchen           <none>  false  initializing
5dd8cfc9  awse  woofound-test-kitchen           <none>  false  ready
567a8ee7  awse  api-cluster-test-kitchen        <none>  false  ready
d6988f10  awse  internal-cluster-0df.stag.awse  <none>  true   ready
2f908c13  awse  internal-cluster-034.stag.awse  <none>  true   ready
e964e768  awse  utility-debian-jessie-ec2       <none>  false  ready
eeb98acf  awse  images-test-kitchen             <none>  false  ready
16913d8d  awse  admin-interface-test-kitchen    <none>  false  ready
310718d4  awse  deck-validator-test-kitchen     <none>  false  initializing
f838c291  awse  billing-test-kitchen            <none>  false  ready
8fdd3a35  awse  woofound-test-kitchen           <none>  false  ready
58034cd5  awse  api-test-kitchen                <none>  false  ready
643f7262  awse  orgs-test-kitchen               <none>  false  ready
b7aa8f0d  awse  fileserve-test-kitchen          <none>  false  initializing
f5e333f4  awse  api-cluster-test-kitchen        <none>  false  ready
27c5a00b  awse  utility-debian-jessie-ec2       <none>  false  ready
a1e121e1  awse  images-test-kitchen             <none>  false  down
0dab9782  awse  deck-validator-test-kitchen     <none>  false  ready
c27dfdd0  awse  admin-interface-test-kitchen    <none>  false  ready
d66284a6  awse  billing-test-kitchen            <none>  false  ready
9e52273b  awse  api-test-kitchen                <none>  false  ready
1bc9eee6  awse  orgs-test-kitchen               <none>  false  ready
f14a9929  awse  woofound-test-kitchen           <none>  false  ready
05072c00  awse  woofound-008.stag.awse          <none>  false  ready
a3eaf6ef  awse  api-cluster-046.stag.awse       <none>  false  ready
df0e8cf3  awse  api-cluster-051.stag.awse       <none>  false  ready
a778eec8  awse  api-cluster-087.stag.awse       <none>  false  ready
31ff8691  awse  api-cluster-0a4.stag.awse       <none>  false  ready
9c7726f2  awse  api-051.stag.awse               <none>  false  ready
1fe66cb1  awse  api-047.stag.awse               <none>  false  ready
392d0b8f  awse  deck-validator-085.stag.awse    <none>  false  ready
1007da5a  awse  deck-validator-0d1.stag.awse    <none>  false  ready
27aa5e74  awse  woofound-0e5.stag.awse          <none>  false  ready
356871a4  awse  fileserve-09a.stag.awse         <none>  false  ready
3fc8b7eb  awse  fileserve-044.stag.awse         <none>  false  ready
0c01f3c6  awse  core-0fa.stag.awse              <none>  false  ready
7e7d2d78  awse  orgs-0fc.stag.awse              <none>  false  ready
5cd6aca6  awse  billing-087.stag.awse           <none>  false  ready
48a53867  awse  images-02d.stag.awse            <none>  false  ready
8b9c4d87  awse  orgs-0c8.stag.awse              <none>  false  ready
c1c19ed2  awse  images-0d1.stag.awse            <none>  false  ready
5ad2de90  awse  billing-0f2.stag.awse           <none>  false  ready
421e8917  awse  utility-02f.stag.awse           <none>  false  ready
d295309f  awse  utility-06c.stag.awse           <none>  false  ready
0c913dd4  awse  admin-interface-005.stag.awse   <none>  false  ready
49d1ec09  awse  admin-interface-055.stag.awse   <none>  false  ready
46ece8ff  awse  utility-004.stag.awse           <none>  false  ready
ec253e5b  awse  core-0af.stag.awse              <none>  false  ready

And here is the output just now ~24 hours later:

ID        DC    Name                            Class   Drain  Status
624984b6  awse  internal-cluster-0ba.stag.awse  <none>  false  ready
7afd5aec  awse  internal-cluster-035.stag.awse  <none>  false  ready
1171c72f  awse  woofound-test-kitchen           <none>  false  down
add73f90  awse  woofound-test-kitchen           <none>  false  down
2e753b69  awse  woofound-test-kitchen           <none>  false  down
9cf10e88  awse  woofound-test-kitchen           <none>  false  down
a424bb84  awse  woofound-test-kitchen           <none>  false  down
5dd8cfc9  awse  woofound-test-kitchen           <none>  false  ready
567a8ee7  awse  api-cluster-test-kitchen        <none>  false  down
d6988f10  awse  internal-cluster-0df.stag.awse  <none>  true   down
2f908c13  awse  internal-cluster-034.stag.awse  <none>  true   down
e964e768  awse  utility-debian-jessie-ec2       <none>  false  down
eeb98acf  awse  images-test-kitchen             <none>  false  down
16913d8d  awse  admin-interface-test-kitchen    <none>  false  down
310718d4  awse  deck-validator-test-kitchen     <none>  false  down
f838c291  awse  billing-test-kitchen            <none>  false  down
8fdd3a35  awse  woofound-test-kitchen           <none>  false  down
58034cd5  awse  api-test-kitchen                <none>  false  down
643f7262  awse  orgs-test-kitchen               <none>  false  down
b7aa8f0d  awse  fileserve-test-kitchen          <none>  false  down
f5e333f4  awse  api-cluster-test-kitchen        <none>  false  down
27c5a00b  awse  utility-debian-jessie-ec2       <none>  false  down
0dab9782  awse  deck-validator-test-kitchen     <none>  false  down
c27dfdd0  awse  admin-interface-test-kitchen    <none>  false  down
d66284a6  awse  billing-test-kitchen            <none>  false  down
9e52273b  awse  api-test-kitchen                <none>  false  down
1bc9eee6  awse  orgs-test-kitchen               <none>  false  down
f14a9929  awse  woofound-test-kitchen           <none>  false  down
05072c00  awse  woofound-008.stag.awse          <none>  false  ready
a3eaf6ef  awse  api-cluster-046.stag.awse       <none>  false  ready
df0e8cf3  awse  api-cluster-051.stag.awse       <none>  false  ready
a778eec8  awse  api-cluster-087.stag.awse       <none>  false  ready
31ff8691  awse  api-cluster-0a4.stag.awse       <none>  false  ready
9c7726f2  awse  api-051.stag.awse               <none>  false  ready
1fe66cb1  awse  api-047.stag.awse               <none>  false  ready
392d0b8f  awse  deck-validator-085.stag.awse    <none>  false  ready
1007da5a  awse  deck-validator-0d1.stag.awse    <none>  false  ready
27aa5e74  awse  woofound-0e5.stag.awse          <none>  false  ready
356871a4  awse  fileserve-09a.stag.awse         <none>  false  ready
3fc8b7eb  awse  fileserve-044.stag.awse         <none>  false  ready
0c01f3c6  awse  core-0fa.stag.awse              <none>  false  ready
7e7d2d78  awse  orgs-0fc.stag.awse              <none>  false  ready
5cd6aca6  awse  billing-087.stag.awse           <none>  false  ready
48a53867  awse  images-02d.stag.awse            <none>  false  ready
8b9c4d87  awse  orgs-0c8.stag.awse              <none>  false  ready
c1c19ed2  awse  images-0d1.stag.awse            <none>  false  ready
5ad2de90  awse  billing-0f2.stag.awse           <none>  false  ready
421e8917  awse  utility-02f.stag.awse           <none>  false  ready
d295309f  awse  utility-06c.stag.awse           <none>  false  ready
0c913dd4  awse  admin-interface-005.stag.awse   <none>  false  ready
49d1ec09  awse  admin-interface-055.stag.awse   <none>  false  ready
46ece8ff  awse  utility-004.stag.awse           <none>  false  ready
ec253e5b  awse  core-0af.stag.awse              <none>  false  ready

All those boxes that showed up as down in the more recent output have been down for days - far more than 24 hours - even the ones that showed up as initializingor ready in the older output.

@dadgar
Copy link
Contributor

dadgar commented Oct 30, 2017

@dansteen Ah! Your leadership flapping explains this issue. You are loosing leadership roughly every minute:

7772:Oct 27 17:28:29 nomad-02f nomad[24645]:     2017/10/27 17:28:29.077011 [INFO] nomad: cluster leadership lost
8131:Oct 27 17:29:34 nomad-02f nomad[24645]:     2017/10/27 17:29:34.796362 [INFO] nomad: cluster leadership lost
8483:Oct 27 17:31:08 nomad-02f nomad[24645]:     2017/10/27 17:31:08.749761 [INFO] nomad: cluster leadership lost
8795:Oct 27 17:32:12 nomad-02f nomad[24645]:     2017/10/27 17:32:12.135644 [INFO] nomad: cluster leadership lost
9783:Oct 27 17:36:50 nomad-02f nomad[24645]:     2017/10/27 17:36:50.811047 [INFO] nomad: cluster leadership lost
10149:Oct 27 17:37:55 nomad-02f nomad[24645]:     2017/10/27 17:37:55.877377 [INFO] nomad: cluster leadership lost

Only the leader will mark a node as down based on the heartbeat, and when there is a transition, a 5 minute grace period is applied before marking a node as down. This is done to ensure that all the clients in the cluster learn about the new leader so they don't get marked as down needlessly.

Since you didn't have a leader for the 5 minutes + the timeout, those nodes didn't get marked as down. Based on the metrics you showed, as soon as there was a leader for a longer amount of time, the nodes were marked as down.

Looking at your logs, you are having contact times around ~750ms between servers. You need to reduce that significantly: https://www.nomadproject.io/guides/cluster/requirements.html#network-topology

I am going to close this issue now that we know what is happening. I will be looking into potentially changing the grace period based on cluster size since 5m may be a bit too conservative on smaller clusters.

@dadgar dadgar closed this as completed Oct 30, 2017
@dansteen
Copy link
Author

dansteen commented Oct 30, 2017

Nice @dadgar! I can see how that would be an issue. However, I'm not sure we are quite out of the woods yet. Where in the logs do you see that we have >750ms? From what I can see we are sub 1ms:

64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=1 ttl=64 time=0.503 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=2 ttl=64 time=0.490 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=3 ttl=64 time=0.443 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=4 ttl=64 time=0.499 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=5 ttl=64 time=0.502 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=6 ttl=64 time=0.456 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=7 ttl=64 time=0.503 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=8 ttl=64 time=0.431 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=1 ttl=64 time=0.503 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=2 ttl=64 time=0.490 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=3 ttl=64 time=0.443 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=4 ttl=64 time=0.499 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=5 ttl=64 time=0.502 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=6 ttl=64 time=0.456 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=7 ttl=64 time=0.503 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=8 ttl=64 time=0.431 ms

As a note, these servers are all in different Availability Zones in the same DataCenter in AWS.

@dadgar
Copy link
Contributor

dadgar commented Oct 30, 2017

@dansteen Hmm usually AZs have a good network latency between them. May have been a ephemeral issue. You can see the results of grepping contact.*in : https://gist.github.com/dadgar/623840e44ed09c741264a7bbad391310

You should monitor the metric nomad.raft.leader.lastContact. For more telemetry details check out: https://www.nomadproject.io/docs/agent/telemetry.html

@dansteen
Copy link
Author

dansteen commented Oct 30, 2017

Thanks @dadgar here is a graph of nomad.raft.leader.lastContact:

screen shot 2017-10-30 at 4 57 58 pm

I'm not sure I believe that we had latency of 15 seconds between the two boxes (our graphs go back 30 days, and it seems to have been an issue all that time), but it certainly illustrates your point and matches the timeframe where the issues stopped. If we see this again I'll start debugging from there. Thanks for all your help with this.

@github-actions
Copy link

github-actions bot commented Dec 6, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants