-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reserved port ranges can result in nodes being many MBs #3072
Comments
Hey @dansteen, Do you have reproduction steps? Did you do anything of note because I have never seen this before and that code hasn't changed in a very long time. If you have server logs in debug mode could you share those (grep for TTL expiring https://github.com/hashicorp/nomad/blob/master/nomad/heartbeat.go#L92 of that node). |
Hi @dadgar! It happened for any box that was removed for a while, and then I did a full nomad server cluster shutdown and restart, and it is no longer happening to new removals (though it is still an issue for boxes that were removed prior to the restart). I have put my cluster into DEBUG mode, and it is definitely throwing those error lines. Here is the one for the node mentioned above:
I don't see a whole lot of additional information around that though. Is there something specific that you would like to see? I've even tried to completely stop the nomad job that is running on that node (that would be bad in production, but we are not running nomad in production yet), and it registers the stop request, and changes the "desired" state to As I mentioned, this is our testing environment, so I can do any sort of destructive testing you think would be helpful. Obviously, I could stop all jobs, clear the data folder and start nomad up again, but that would not really resolve the underlying issue, and I would not have that option if this happens again once we move out to production. Thanks! |
@dansteen So even when you see the message that the TTL expired, if you do |
@dadgar yes correct. |
Ok so the things that happened were:
Can you reproduce this or it has only happened once? |
Mostly. More holistically, from the very start of having issues it was:
Here is the log from step 5 above. There were many thousands of lines like this:
Note, that I am still seeing log messages like this fairly frequently:
But I don't see the ERR messages anymore - although, those went away after I updated consul and vault and before I did a full restart of nomad. Sorry about not providing the full context earlier. |
Thanks for the additional detail. It may have been the nodes went down during the leader election and then nothing handled them! |
Seems very likely considering that leader elections was going on every couple seconds. Is there anyway to force nodes to the "down" state? In the meantime, I have put those nodes into the node-drain state so that my allocations don't fail, but I'd like to get rid of them if I can. |
@dansteen There isn't really an easy way (no api). Can you also give me the output of On leader election all timers for healthy nodes are reset and if the heartbeat doesn't occur by then the node is transistioned to down: https://github.com/hashicorp/nomad/blob/master/nomad/heartbeat.go#L15-L106 |
@dadgar here is the node information:
|
@dadgar Ok, interesting bit of additional information. In some cases I am using reserved ports, however, I build my configs using chef, so by way of standardization, I included the following in my nomad client config in cases where I did not have any specific port reservations:
(I realize this is just allowing the default set of ports that nomad was allowing anyway, but it was just by way of standardization, and I figured it wouldn't make any difference). While generating node information for you in the response above, I realized that this actually resulted in ~53000 entries in the Once I removed that from the config, the node data size went back down, and my cluster cleaned itself up on its own - no more zombie nodes! One final interesting thing is that it was the "live" nodes that had 6M node-data. The dead nodes that wouldn't go away all had the usual 5k (because they had been shut-down as part of the same update that added in all those reserved ports). Anyway, this was an interesting lesson in how this all works, but my sense is that, due to the way the syntax is for port reservations, this is the sort of thing other people will run into and maybe some sort of warnings or something might be in order? Thanks for your help! |
@dansteen Nice debugging. I am going to retitle the issue to track that problem. We need a way to be able to just pass ranges rather than actually creating an object per port. |
@dadgar are there any updates on this issue? |
@evandam This should be fixed in 0.9.0. Are you facing this issue in a recent release? |
Gotcha, thanks! I'm not seeing it but saw this issue still open and wanted to verify before running it in production 😄 |
@evandam Awesome! Thanks for the update. I am going to close this issue then, I think we just missed it when the linked PR got merged! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.6.0
Operating system and Environment details
Archlinux
Issue
I have a number of nomad client boxes that were shut down. However, nomad still shows the node as being in the "ready" state when I do a
nomad node-status
. However, when I list the allocations on the node, some show as beinglost
and others show as beingpending
. When I do analloc-status
of one of the allocations in thelost
state it tells mealloc is lost since its node is down
. However, when I do an alloc-status of on of the allocations in thepending
state it thinks the node is up.Here are the details:
Notice that it thinks the nodes
status
isready
. However, when I do analloc-status
on the allocation in thelost
state, it knows the node is down:But when I do an
alloc-status
on a node in thepending
state it still thinks the box is up:Thanks!
The text was updated successfully, but these errors were encountered: