-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pending allocations #6461
Comments
hi @jorgemarey , there were a number of changes in 0.9.6 (released yesterday) around client and allocations, but I don't have any good reason to believe that they will address the issue that you're seeing. I'm going to look at your report and see if I can reproduce it, but in the mean time, if you are able to try out 0.9.6, I would be curious to see whether it also exhibits this behavior. |
Hi @cgbaker , thanks for the reply. I'll change the nomad version and see if this happens again.
I don't see anything weird, but maybe you do. Maybe the allocation was killed but internally it wasn't removed from the GC heap and it was tried to be garbage collected again and that produced a lock on the |
@jorgemarey , yeah, that's sort of what i was theorizing as well. let me know how 0.9.6 works out. |
#5363 it looks the same issue and says it still reproduce in 0.10.0 |
I too often notice this unexplained behavior where the cluster seems to be waiting for something, dunno what. For my experiments I have suspected the default backoff timings which cause it to go into some sort of wait. Usually to get over it quickly, I use nomad-helper with reevaluate-all. I know it is no solution, but a workaround. |
Hi @cgbaker . This happened to me again on v0.9.6 Here is some information about the node:
As you can see, the node is draining. I tried to move the allocaitions to another node, but they stayed running. I had to restart nomad to move the allocations. In the moment of the first allocation that get the pending state the logs shows again:
That allocation doesn't appear in the list of allocations of the node (I guess it was garbage collected as the log says, but maybe not correctly).
Following are the other logs that I found about that allocation. When it started:
Another moment.
As you can see, here there is another message about garbage collecting this allocation, which I guess shouln't happen. Thats all I could find, I'll keep the logs anyway if they can help in a future. |
Hi again,
Two goroutes were blocking when waiting to finish task killing. Thus reaching The livecycle manager calls Inspecting the goroutines I found 2 that started on task_runner and were blocking:
Looking at the code I saw that when livecycle calls killCtxCancel that cancels the context, but the only posthook that does not use that context is logmon (there the context is ignored). Thus leading to all the blocking explained earlier. The problem (beside the context being ingored) seems to be on the grpc library. I track the goroutine to this, but I don't really know if thats what's happening here. Maybe upgrading the grpc dependencies will fix this. Hope that this heepls to identify the problem completely. |
@jorgemarey That's some great sleuthing! I'll dig into it next week and follow up. Thank you very much! |
Add an RPC timeout for logmon. In #6461 (comment) , `logmonClient.Stop` locked up and indefinitely blocked the task runner destroy operation. This is an incremental improvement. We still need to follow up to understand how we got to that state, and the full impact of locked-up Stop and its link to pending allocations on restart.
Hey there Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this. Thanks! |
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍 |
I think hit this too now. Added
I am also seeing this in the log (searching for this error is what led me here):
There is a Vault integration on the nomad cluster, but not for this job. Attempting to trigger gc through the API does remove all the allocations, but does not remove the job, and no new attempt at running new revisions have any effect. When restarting the leader, there is a batch of logs coming on another node:
During the failing attempts, there was no log output from the job though I would expect it, so it looks like it could be related to the loggingm as you mentioned @notnoop Nomad version 0.11.3. |
Add an RPC timeout for logmon. In hashicorp#6461 (comment) , `logmonClient.Stop` locked up and indefinitely blocked the task runner destroy operation. This is an incremental improvement. We still need to follow up to understand how we got to that state, and the full impact of locked-up Stop and its link to pending allocations on restart.
Nomad version
Nomad v0.9.3 (c5e8b66)
Operating system and Environment details
RHEL 7.5
Issue
At some point all allocations placed on a node remain in pending state.
Reproduction steps
I'm not able to reproduce this behaviour, but it happend several times on different nodes on our cluster once we updated from 0.8.4 to 0.9.3
I found out that this happens after the GC runs when allocations reach gc_max_allocs. But it doesn't happens every time.
Nomad Client logs (if appropriate)
As you can see in the logs, the allocation with ID:
e6ee3260-a1c9-c283-ad17-d1cb6950774d
doesn't get garbage collected.That allocation shows the following (note that times in the log are UTC and in my PC are GMT+2):
The last event is when I restarted the nomad service in the instance.
Seems like nomad is trying to garbage collect an already dead allocation.
As this happened to us several times before I enabled the profiler on the node to get some data. I saw the following (taken at Thu Oct 10 09:00:00 CEST 2019):
The time of first one (595 minutes) is the same that the first pending allocation had when placed on the node (Oct 9 21:05:00). The other tracks back to this moment:
Seems like the allocation
ec8a075b-7946-7fee-303f-a02fdf53ae1a
doesnt't get correctly garbage collected either.It seems that when the node reach
gc_max_allocs
andgc_parallel_destroys
are blocked all allocations that are placed on that node remain pending state until the nomad service is rebooted. At least that is what I feel is happening here.Weirdly, I thought that pending allocations would be blocking here: https://github.com/hashicorp/nomad/blob/v0.9.3/client/gc.go#L180
But no goroutines appeared there, so they must be waiting someplace else...
The text was updated successfully, but these errors were encountered: