-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes won't garbage collect #3658
Comments
Would you be able to test Nomad 0.7.1-rc1 that was just released Monday? I believe #3445 fixes your issue. |
|
0.7.1 doesn't seem to have helped @schmichael, I'm still in the same situation. 1: Most of them are dead:
2: Here's the logs from the 0.7.1-rc1 box:
3:
|
That job:
|
I managed to run
Then the allocs still can't gc:
|
We ended up bringing down all masters, wiping all raft/client.db, and starting from scratch. All gc'ing was broken across all nodes, and it looks like it's been broken since 0.5->0.6 upgrade. |
@apenney Unfortunately the logs you gave are client logs not the server logs. So there really isn't enough information to try to track this down. I am going to close this until we do get more information or you can provide repro steps. If it does happen again, following the steps I outlined in my response should provide the information we need. |
Not sure if I can reopen this but here are the server logs during a gc, as this issue is now reoccurring. Server3 is the master in the below paste. Symptoms, just to recap: Nodes fill up with allocs, and can't gc. I have over ~3000 jobs when I should have about 20, most of those are "dead" batch jobs that never go away. On the servers I can see it constantly trying to alloc and failing to get vault tokens for jobs that are dead and shouldn't be doing anything. Server1:
Server2:
Server3:
I only have debug enabled on the first two, sorry. Probably makes it easier to read at least. |
@dadgar direct ping as you closed it, you may be able to open it :) |
I have a similar (though not identical) issue about nodes not collecting garbage correctly. I am using Nomad 0.7.1/Consul 1.0.2/CentOS 7.4+/Docker CE. My issue is that "/var/lib/docker" bloats up over time (I have image cleanup purposely set to 'false'). This fills up the disk and nomad reports the machine "ready" but no allocation go the machine as the disk capacity is exhausted. I don't have the logs with me right now, but it keeps emitting a message along the following line: So I have a node which is basically unusable but not being indicated as such by Nomad. I know that '/var/lib/docker' bloating up is not really Nomad's problem, but just wanted to put this out there. Regards, |
@apenney Thanks for the additional logs! In the future please attach them as text files or link a gist to make browsing this issue easier. We'll almost always want to download them to search/grep/etc. @shantanugadgil As far as we can tell right now @apenney's issue is related to the servers GCing dead jobs. Your issue is specific to clients. Ensure you have not set |
@apenney Hey I am sorry but I just can't reproduce this. I will reopen when there are clear reproduction steps. The reproduction steps will have to include the configuration and how to get to this state from a fresh cluster. I have even tested by creating a periodic job that fails in the same fashion as your attached allocations but as soon as I hit the GC endpoint, the jobs and allocations get removed. |
@apenney And if you post logs can you please post the Nomad Server logs and not the client's logs. After you do the
This should be logged if you are in DEBUG level logs regardless of if anything gets garbage collected. So if you don't see that something is odd with your setup. |
@dadgar thanks to the wonderful @jippi we've fixed this! On our server nodes we had:
Removing these two lines immediately fixed the issue, with it starting to delete thousands of piled up batch jobs. We don't really understand -why-, unless it was starved by a low num_schedulers stopping it ever getting to gc? I just wanted to let you know, and let the world know, in case someone else stumbles over this bug report. |
@apenney glad this was fixed for you. And you guessed right - see the docs for num_schedulers. By setting it to 2, the ability of nomad servers to be able to run scheduler workers in parallel was severely reduced. Garbage collection is done as an internal scheduled job as well so that explains why. |
Had the exact same problem, running only 1 scheduler, on 1vCPU test nodes. After removing both lines mentioned above, gc started immediately. After putting both config lines back, GC still works. On 1vCPU, the default is still 1 scheduler, so it doesn't seem to be the My test setup was updated in-place/with rolling redeploys from 5.6 and did almost all versions until 7.1. Haven't restarted all my nodes yet, but I'll report back here if I'm able to reproduce the issue after restarting all nodes again. |
@dadgar @preetapan I was able to reproduce the problem again in my setup by performing a rolling re-deploy. Removing the When the issue is active, server logs (in debug mode) show the |
I think I found the issue, the default value for Line 363 in b3ec943
But if a custom value is provided, the |
@groggemans Great find. This was definitely not intended and we will get this fixed up for 0.8 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.7.0
Operating system and Environment details
Ubuntu 16.04
consul 0.9.2/0.9.3
Issue
Nodes won't GC old jobs:
I see submit's back as far as September. I've enabled debug logging, and my client settings are:
I see zero log lines matching 'garbage', which is what I expect to see based on gc.go.
Reproduction steps
This one is hard, I don't know why the gc gets skipped/ignored so I'm not sure what to say about reproduction.
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
I deleted all the lines matching 'secret', and snipped out some company name stuff. Hopefully this shows the lack of garbage/gc tho.
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: