-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad shows node as "ready" even though it is down (and nomad knows it's down) #3458
Comments
@dansteen Could you provide all your server logs? Would you mind doing |
Hi @dadgar thanks for the response! I ran the gc and gathered some logs. We have 3 nomad servers in our cluster, so I have included logs from each one. One other thing that I noticed is that we seem to have an extra server listed in
compared to the output of
I don't know where that came from, or what box it is since it doesn't seem to actually exist at this point. Could that be related to the problem? |
@dansteen Ah I should have asked for you to up the log level to debug first. Would you mind doing that and then letting it run like that for ~20ish minutes, GC and then send those? You can bump the log level in the config and just SIGHUP the servers. |
I would also do a |
@dadgar sorry about that, I should have known. Anyway, here are the fancy new logs: |
@dansteen Can you show me the metric You can get the heartbeat metric either from your metrics sink or if you aren't exporting them if you send SIGUSR1 on the leader node it will output the metrics to the STDERR |
Unfortunately (or fortunately?) it looks like nomad finally figured out that those nodes were down. Here is the output of curling the nodes: Here are graphs of the Interestingly, it looks like nomad finally figured out that a whole bunch of boxes were down. (according to the graphs above, that may have happened around 20:30, if I'm understanding the information correctly). Either way, here is the output of
And here is the output just now ~24 hours later:
All those boxes that showed up as |
@dansteen Ah! Your leadership flapping explains this issue. You are loosing leadership roughly every minute:
Only the leader will mark a node as down based on the heartbeat, and when there is a transition, a 5 minute grace period is applied before marking a node as down. This is done to ensure that all the clients in the cluster learn about the new leader so they don't get marked as down needlessly. Since you didn't have a leader for the 5 minutes + the timeout, those nodes didn't get marked as down. Based on the metrics you showed, as soon as there was a leader for a longer amount of time, the nodes were marked as down. Looking at your logs, you are having contact times around ~750ms between servers. You need to reduce that significantly: https://www.nomadproject.io/guides/cluster/requirements.html#network-topology I am going to close this issue now that we know what is happening. I will be looking into potentially changing the grace period based on cluster size since 5m may be a bit too conservative on smaller clusters. |
Nice @dadgar! I can see how that would be an issue. However, I'm not sure we are quite out of the woods yet. Where in the logs do you see that we have >750ms? From what I can see we are sub 1ms:
As a note, these servers are all in different Availability Zones in the same DataCenter in AWS. |
@dansteen Hmm usually AZs have a good network latency between them. May have been a ephemeral issue. You can see the results of grepping You should monitor the metric |
Thanks @dadgar here is a graph of I'm not sure I believe that we had latency of 15 seconds between the two boxes (our graphs go back 30 days, and it seems to have been an issue all that time), but it certainly illustrates your point and matches the timeframe where the issues stopped. If we see this again I'll start debugging from there. Thanks for all your help with this. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.6.3
Operating system and Environment details
debian
Issue
I have a number of nomad client boxes that were shut down. However, nomad still shows the node as being in the "ready" state when I do a nomad node-status. However, when I list the allocations on the node, they show as being
lost
with a messagealloc is lost since its node is down
. Clearly nomad knows the node is down, but is not registering it as such. It is also trying to create new allocations on those nodes which causes all sorts of problems (I have put them indrain
state so that things can keep working).Note that this is very similar to #3072 . However, in that case we tracked the issue to node data being too large. That does not seem to be the cause here.
Here are the details:
When I do a
nomad node-status
I get:Notice it shows a
status
ofready
, but under allocations it showslost
. When I do analloc-status
on that allocation I get:Notice that it tells us
alloc is lost since its node is down
Thanks!
The text was updated successfully, but these errors were encountered: