-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job still running though received kill signal #5395
Comments
@FlankMiao can you share a jobspec so we can try to reproduce it on our end? |
@preetapan ` job "artifact-agent" { group "artifact" {
} |
Also running into a similar issue as well;
See my comments on #5363 |
@FlankMiao I ran your job spec (but used my own executable for the artifact), and noticed that the container would get killed after it started but not always. It seems to be related to the memory (40), after I increased it, it seemed to stabilize. Could you see if increasing memory in the |
@preetapan I have tried, but this still happens. |
can you share all the logs from the client on and after the time you stopped the job? Its been hard to reproduce this (a couple of us on the team tried). |
@preetapan it,s hard to reproduce, and this could happen on different job. this time is the job: promtail. Here is the log from nomad client: 2019-03-11T12:17:06.863+0800 [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=6d3679d1-f40e-f2ce-080c-3829874ee7ac driver=raw_exec task_name=promtail error="rpc error: code = Unavailable desc = transport is closing" here is log in the noamd server: 2019-03-11T12:17:06.785+0800 [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=3228fc8d-dff6-7517-88f4-be81dbb47cdb driver=raw_exec task_name=promtail error="rpc error: code = Unavailable desc = transport is closing" here is the state from nomad-ui: the job with allocid: 08011041 doesn.t exist in it. when I click the start button: logs in the nomad server: 2019/03/11 12:31:56.629972 [INFO] (runner) creating new runner (dry: false, once: false) Now there are two run on the node, one is the 08011xx, other is the new started. like this: I killed the process outside nomad with [kill command], and it,s still shows in the nomad. |
See the separate issue I created; I've been able to reproduce this exact behaviour. NB: I created a separate issue to capture more precise behaviour of what I was observing (Nomad not observing OOM kills by the OOM killer or SIGKLL in general). |
In our specific case are are getting bitten by the Kernel OOM Killer as our memory reservations (Using the In general as well; don't use |
We're seeing this as well (When jobs are OOM killed)
So it seems like nomad does have some info about the task getting OOM killed. However, it is then restarted and then kinda just abandoned. I wonder if this causing more OOM kills on our system because then Nomad thinks it can pack in more tasks on this instance?
|
We are running into an issue where nomad can't stop an allocation. Attaching the agent logs. Note how it is retrying to kill allocations
Nomad version:
Agent logs:
|
Same issue here |
@preetapan this issue bites us very regularly and it seems like a lot of other users are also impacted by this. Any updates on this? |
We have seen this kind of behaviour with docker driver as well. We have observed the following: When this happens nomad seem to have lost communication with docker on that client. If we stop the container manually (which works without any problems) nomad still think it's running. Docker version: |
We're observing the exact same behaviour as @parberge . Nomad sets desired state CentOS 7 - Wish I could provide more info but we are rapidly reverting one of our staging clusters to 0.8.x. |
Fun fact: it's not possible to downgrade the nomad servers from 0.9.x to 0.8.x so we've just downgraded our nodes for now. Running 0.9.3 servers + 0.8.4 nodes has resolved the unstoppable allocations, which is to say the problem can probably be safely attributed to the client. We've kept Docker 18.09.7 so can rule that out. |
That's really good to know. Might try that as well |
Nomad: 0.9.3 Hey team, We're having a heck of a time safely reproducing this issue. We have a staging cluster that has 0.9.3 servers and 0.8.4 nodes. If I add a couple of 0.9.3 nodes and use job constraints to keep our "real" apps off of them, I can deploy and drain all day without triggering the issue. As soon as I deploy any of our real apps to 0.9.3 though, we will eventually trigger the bug (and then the bug affects real apps and test apps). Specifically what happens:
If we terminate the node, the nomad servers will recognize the allocations as "lost" (which is expected). To reiterate we're experiencing this with the docker driver and not exec. The only thing I can think of that's different before and after triggering the bug is that "real" apps will sometimes restart as they come up, since some try acquiring locks and don't return healthchecks until they succeed. Nomad dutifully restarts these apps and after a few minutes everything is stable. I don't know if this leaves any lingering state, but we've only observed this bug on nodes where allocations have been restarted due to health checks. It may be coincidental -- it may not. I managed to capture logs and telemetry from one of our nodes as this happened. They're about 300+ lines long so I'm linking to a gist rather than pasting them inline. I've performed a find-and-replace on the host name and most of our service names since those details are not meant to be out on the internet, but I've left alone the alloc/node IDs since those are just giant UUIDs. Node drain, which begins at 00:13:50 UTC, and corresponding consul (1.5.1) and nomad (0.9.3) debug logs: https://gist.github.com/byronwolfman/570ab3918c9d378a714598bc45d195ca Some allocations do actually finish draining, but most do not. That the drain log is shows everything that we see: we never get our prompt back, and never progress past the last message at 00:14:43. The consul/nomad logs on the other hand have been cropped to the events just as the drain is triggered, and a few minutes after. Nothing interesting appears in the nomad/consul logs after 00:18:57; it just keeps repeating /v1/agent/health and occasionally /v1/node/95b3c3c5... We know that resource exhaustion isn't the problem since we don't have any blocked evaluations, and there is plenty of unallocated resources on other nodes. We're thinking of compiling 0.9.3 with a few extra log messages, but I'm not sure where the best place to put them would be. If you have some recommendations on where to put them, we can try reproducing again in the evening when breaking a staging cluster doesn't affect our teams. A final gift: we emit statsd and have telemetry. This is only a sample of what looked like some interesting metrics, but we can produce others if you'd like to see them. Metrics from the host that was being drained: The graph is in EDT but you can see at 20:14 EDT (00:14 UTC) that goroutines essentially flat line. We're not sure if this is meaningful but it's interesting. |
Hey sorry for being a nuisance in this issue. We managed to reproduce in a vagrant environment today. Steps to reproduce:
The issue seems to be agnostic to exec/docker because the problem is in the taskrunner's service pre-killing service hook. Specifically here: We did the grossest thing possible for a team of sysadmins who don't write go, and just bolted on some more logging statements: 148 func (h *serviceHook) PreKilling(ctx context.Context, req *interfaces.TaskPreKillRequest, resp *interfaces.TaskPreKillResponse) error {
149 h.logger.Info("[ws] PreKilling invoked; top")
150 h.mu.Lock()
151 defer h.mu.Unlock()
152
153 // Deregister before killing task
154 h.logger.Info("[ws] PreKilling invoked; will deregister")
155 h.deregister()
156 h.logger.Info("[ws] PreKilling invoked; deregister returned") When the node is working normally, we see all three messages emitted: 1. we're at the top of the PreKilling function, 2. now we're deregistering from consul, 3. now we've deregistered. When things break, we can watch in real-time as we get locked out by the mutex. Have a look at this:
Uh oh. "will deregister" -- but we don't get "deregister returned". That seems bad. And now this happens:
Well now we don't even get "will deregistered" because the mutex never unlocked. Now we're stuck. |
As a preface: I'm nearly 100% certain that this issue is now driver agnostic. It definitely seems to be consul related (or rather, check watcher related) and sure enough even though the original submission is for the exec driver, the jobspec file posted has a Also this is probably beyond my ability to solve. I could be looking in the wrong places here, but it looks to me like the biggest change between 0.8.4 (what we're mostly running) and 0.9.x is that allocation restarts due to healthcheck failures are handled somewhat synchronously, and also very slowly if a shutdown delay is used. I've added another (yes, sorry) logger output to https://github.com/hashicorp/nomad/blob/v0.9.3/command/agent/consul/check_watcher.go#L163 and I can see the check watcher's main watch loop grind to a halt over time. The loop is expected to be stopped when When an allocation becomes unhealthy, we go here: https://github.com/hashicorp/nomad/blob/v0.9.3/command/agent/consul/check_watcher.go#L257 Then here: https://github.com/hashicorp/nomad/blob/v0.9.3/command/agent/consul/check_watcher.go#L63 And then here: https://github.com/hashicorp/nomad/blob/v0.9.3/command/agent/consul/check_watcher.go#L109 In Nomad 0.8.4,
(via https://github.com/hashicorp/nomad/blob/v0.8.4/client/task_runner.go#L1757-L1765) In 0.9.x (and at least up to the mast branch as of today), restart is far more synchronous: https://github.com/hashicorp/nomad/blob/v0.9.3/client/allocrunner/taskrunner/lifecycle.go#L11 In particular, we run pre-kill hooks, which has some interesting consequences. For example: most of the services that our shop runs have a long shutdown delay to allow for traffic draining. This works fine when we deploy stuff (old allocs are de-registered from consul, drain for 45 seconds, and are then shutdown). In 0.8.4 a restart was just a restart; in 0.9.3 a restart fires the pre-kill hooks which includes a 45 second delay. If we have a bunch of allocations all restarting at once, then the check watcher has to loop over every one of them and kill them one at a time, waiting 45 seconds between each pre-kill hook returning. Each time we kill something, we throw a message into the check watcher's update channel. Unfortunately the check watcher cannot retrieve these messages yet because it's still busy killing services. Eventually the channel fills up to its max and no more messages can be inserted. It's a blocking call though, so those allocations that were taking "only" 45 seconds to get out of the way are now dead-locked. At least, that's the working theory. I'm going to look around to see if there's a non-horrible way to skip the shutdown delay when an allocation is killed due to a bad healthcheck. Maybe it will solve the problem, maybe it won't. It may be that there's a more deeply-rooted problem and up until now, Nomad has previously been able to chew through its restart and update queue quickly enough to hide it. |
Well never mind -- turns out the solution is a lot easier. I need to spend some more time actually validating that this solves the problem, but on my crash-loop apparatus I've been able to run the client for 10 minutes now, which is about 9 minutes longer than usual before everything locks up. Here's the main change: diff --git a/client/allocrunner/taskrunner/lifecycle.go b/client/allocrunner/taskrunner/lifecycle.go
index 404c9596f..b10df92df 100644
--- a/client/allocrunner/taskrunner/lifecycle.go
+++ b/client/allocrunner/taskrunner/lifecycle.go
@@ -22,8 +22,11 @@ func (tr *TaskRunner) Restart(ctx context.Context, event *structs.TaskEvent, fai
// Emit the event since it may take a long time to kill
tr.EmitEvent(event)
- // Run the pre-kill hooks prior to restarting the task
- tr.preKill()
+ // Run the pre-kill hooks prior to restarting the task, but skip in case of failure
+ // Service deregistration will be done by the check watcher so no need to deregister via pre-kill hooks
+ if !failure {
+ tr.preKill()
+ }
// Tell the restart tracker that a restart triggered the exit
tr.restartTracker.SetRestartTriggered(failure) This is lessy hacky than it looks. The pre-kill hook really only does two things:
An unhealthy allocation should probably just be restarted right away, so that takes care of the second pre-kill action. The first pre-kill action doesn't actually need to be done at all, since the check watcher already does this on its own: restarted := check.apply(ctx, now, result.Status)
if restarted {
// Checks are registered+watched on
// startup, so it's safe to remove them
// whenever they're restarted
delete(checks, cid)
restartedTasks[check.taskKey] = struct{}{}
} (Late night edit: this is not the case. The checks are removed, but not the service registration, so we still need to figure out a way to remove the service from consul if we skip the pre-kill hooks.) I'd be curious to see if the other reporters on this issue are able to apply the patch, and if it resolves the issue for them. I'll have time to do more thorough testing on Monday. A bonus diff, while we're at it: diff --git a/client/allocrunner/taskrunner/service_hook.go b/client/allocrunner/taskrunner/service_hook.go
index a89110417..2dc1fa524 100644
--- a/client/allocrunner/taskrunner/service_hook.go
+++ b/client/allocrunner/taskrunner/service_hook.go
@@ -176,12 +176,6 @@ func (h *serviceHook) Exited(context.Context, *interfaces.TaskExitedRequest, *in
func (h *serviceHook) deregister() {
taskServices := h.getTaskServices()
h.consul.RemoveTask(taskServices)
-
- // Canary flag may be getting flipped when the alloc is being
- // destroyed, so remove both variations of the service
- taskServices.Canary = !taskServices.Canary
- h.consul.RemoveTask(taskServices)
-
}
func (h *serviceHook) Stop(ctx context.Context, req *interfaces.TaskStopRequest, resp *interfaces.TaskStopResponse) error { We don't need to flip the canary tag to check it again, since the canary status isn't used for generating task service IDs: func makeTaskServiceID(allocID, taskName string, service *structs.Service, canary bool) string {
return fmt.Sprintf("%s%s-%s-%s", nomadTaskPrefix, allocID, taskName, service.Name)
} Which is funny, since not including the canary tag solved the last issue I filed! :D (EDIT: I should mention the reason this diff is riding along, is because halving the number of tasks to remove from consul keeps the check watcher's channel clean -- important since the buffer is small.) But for reals, give the diff a shot. I'll open a PR once I've made sure the fix isn't worse than the bug. |
Having had some more time to think about this, I think we're looking at a pretty classic deadlock; the shutdown delay doesn't really factor into the problem cause (other than making it slightly more noticeable) though I still think we should skip it in the event of a healthcheck failure. A more simplified view of the check watcher's Run() loop would be:
The check watcher loops over all checks, and when it finds an unhealthy one, restarts it. The pre-kill hook in that restart wants to enqueue things back into check watcher though, so all you need to deadlock the check watcher is to have five allocations crash. The first four allocation restarts enqueue at least eight updates (2 updates per check, because of canary/non-canary). The fifth allocation restart tries to enqueue its updates but can't because the channel buffer is full. It's acquired a lock along the way too. Now that the channel is full, and a lock has been acquired, we'll never be able to loop through the remainder of the unhealthy allocations, and because of that, we never make it back to the top of the run loop. Skipping the shutdown delay would really just help surface the deadlock faster. The only real solution (which, fortunately, is accidentally solved by the same diffs above) is to not enqueue work while inside a loop that might block, and that cannot be dequeued until we exit said loop. |
I've a slightly more comprehensive fix up (#5957) but it probably deserves the attention of someone who actually knows golang. |
Fixes #5395 Alternative to #5957 Make task restarting asynchronous when handling check-based restarts. This matches the pre-0.9 behavior where TaskRunner.Restart was an asynchronous signal. The check-based restarting code was not designed to handle blocking in TaskRunner.Restart. 0.9 made it reentrant and could easily overwhelm the buffered update chan and deadlock. Many thanks to @byronwolfman for his excellent debugging, PR, and reproducer! I created this alternative as changing the functionality of TaskRunner.Restart has a much larger impact. This approach reverts to old known-good behavior and minimizes the number of places changes are made.
#5975 fixes the deadlock @byronwolfman discovered and will ship in 0.9.4 which will be in RC stage soon. Unfortunately @byronwolfman's #5957 PR did not get merged and contained a couple QoL fixes. We wanted to keep 0.9.4 focused on stability improvements so if @byronwolfman or someone else wants to submit PRs for the dropped improvements we'd be more than happy to consider them for 0.10.0! See #5980 for a description of how ShutdownDelay was improved in @byronwolfman's PR. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.9.0-beta3
Operating system and Environment details
OS: centos
OS version: 7.6.1810
Issue
driver [exec]
the job jobA: one per node.
I clicked the stop button in the nomad-ui,some are [completed] immediately, but others are still [running] though the job status is dead. like this:
here are the events of which is still running:
log in the nomad client:
2019-03-06T08:56:52.296+0800 [ERROR] client.driver_mgr.exec: error receiving stream from Stats executor RPC, closing stream:alloc_id=ce203f54-78cd-fa32-aff0-391c20e219ba driver=exec task_name=artifact error="rpc error: code = Unavailable desc = transport is closing"
when I kill the process manually, it,s still running.
when I clicked the start button, the node which has the jobA still running will has a pending jobA. like this:
the /system/gc or client/allocation/id/gc api doesn.t work.
when I remove the node which has the still running jobA and gc. somehow it all works fine.
The text was updated successfully, but these errors were encountered: