-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad doesn't cleanup some allocs that must be garbage collected #4287
Comments
@tantra35 We did make some changes to the garbage collection logic in 0.8 to make sure that failed allocations that are not yet replaced are not GCed. To help us debug this issue, could you also share your job specification file, and the output of /v1/allocation/<allocation_id> of one of the allocs that did not get GCed? |
here is our job file with problem allocations
result of curl launch
|
it seems that in our case all allocations in fail state not be subjected of GC |
@tantra35 just got back to investigating this after a week's break. I tried using a somewhat modified version of your job spec and so far I haven't been able to reproduce. Could you also provide us the exact steps you took to get nomad into this state where the failed allocs don't GC? Please provide as much detail as possible, especially about specific commands run before the GC. Also helpful to debug would be the output of |
@preetapan for nowdays we stop job with failed not GC allocations, and launch new, before launch we made GC to make sure everything is cleared out, then we waited when deployment will fully complete, and now allocations looks like this
as you can see we have 2 alloc in failed state that live more then 3 days(here short output of
eval status for then on my oppinoon gives nothing interesting
and for second
in our case we sometimes have very unstable network, and have mane connectivity problems between DC |
But also we have jobs with failed state without GC in more stable aws network enviroment for example:
and verbose output
with follow eval status of failed allocation
|
So we doesnt do anything special, simply job standart workflow. And its strange that manual gc doesn't helps and nothing interesting present in nomad server leader logs(although we have DEBUG verbosity) |
@preetapan great work, ofcourse we can try this binary. But I'm confused about that in our case job version wasn't changed
|
@tantra35 that is expected if you ran a |
@preetapan Hm that sounds logical, but I'm quite sure that after stop and re-launch, there wasn't be any allocations in fail state, and was only 8 healed allocation. Also if i understand all correctly i make desigion that id of allocation hasn't changed, so i expect to found them in first message for this issue, but they there wasn't Any way lets try |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.8.3 (c85483d)
Operating system and Environment details
Issue
For some of our jobs present uncleaned allocations, and we can't remove they with forcedly GC
For example for job
githubproxy-branches
As you can see there present 8 allocations in stop sate, which must be GC collected, but they doesn't(and sometimes they update they modify time, so bellow it only have
1h49m ago
in modify time, but this allocations in stop state more then 1 days, our GC collection time it default(4 hours))If we run GC manualy
absolutely nothing changes, and stop allocations still present in
nomad status githubproxy-branches
. When we launch nomad status for those allocations we got error(for example for allocationd036c82e
)Also strange that on few nomad agents which have stoped allocations in allocs dir not present any allocations that must be GC'ed, so I can conclude that on few nodes GC was maked, but on server side this fact doesn't registered
The text was updated successfully, but these errors were encountered: