-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker task driver overrides max_kill_timeout #17023
Comments
Hi @louievandyke and thanks for raising this detailed issue. |
Just to confirm behavior, I can see this same thing happening on v1.4.2. Exactly 5 minutes after an allocation is given the "stop" command, the container exits unexpectedly with an exit code value of 137. An exit code of 137 can possibly indicate an OOM situation, but that certainly has not happened in this case. In the nomad client logs I can see the following:
If I run a manual curl command to stop a container (I believe this is the same thing the Nomad Docker driver is doing):
I see the command blocks in bash until either the container exits or the timeout is reached, so my sense is that Nomad is assuming the docker engine has become unresponsive and is reverting to the 5-minute default fallback Alex Dadgar mentioned here: #2119 (comment)
My guess is that the "max_kill_timeout" should be been applied to that "explicit 5-minute timeout". My use case is that one or more end users could be interacting with the processes running in the allocation, and I have some logic in my application to intercept the kill signal coming from docker and Nomad and and then wait until the last end user disconnects before shutting down the process (and thus allowing the allocation to fully stop). This can be 1-2 hours in some cases. |
@joshuaclausen We're also seeing this in production as well and have similar shutdown times ( 1-2 hours ), I was able to reproduce it consistently with docker 23, I've since reverted us to use docker 20.10 and it's working like it should. Currently on Nomad 1.5.5 |
I had an afternoon free and started to dig into this a bit. I had a "fix" PR up, but there were still open questions around why that fix actually worked. I documented the open questions, in this comment. Hope this can give somebody else a headstart in getting an official fix up. I'm unassigning myself as I'll have to focus on producty things in the near future. Will work with engineering to get this in the queue for right after the 1.6 beta goes out. |
Thanks @mikenomitch I actually opened an issue this morning #17499 which may be related, also filed one with docker as well after the rabbit hole I went down as i'm not entirely sure if it's a nomad issue or a docker api issue. |
@louievandyke and @mikenomitch can you note the docker version you are using? So far I haven't been able to reproduce on
|
@shoenig here you go
If you want to pair on a repro let me know. Maybe we'll catch something that we're doing differently. Or mac related? |
Fixed my repro, had a typo in the template shell script 😨 I tried running against a few versions of docker to see what's up, and turns out with recent versions the symptom goes from bad to worse. Starting with 23.0.0 instead of the SIGTERM at 5 minutes it sends SIGKILL.
It is still unclear whether docker is issuing a signal at all on behalf of an erroneous instruction coming from Nomad, but I just wanted to note upgrading Nomad or Docker doesn't help. Here is what it happens from the
My next step is to decorate Nomad with a whole bunch of extra trace logging to see if we can't observe it doing something unexpected. |
This PR refactors how we manage the two underlying clients used by the docker driver for communicating with the docker daemon. We keep two clients - one with a hard-coded timeout that applies to all operations no matter what, intended for use with short lived / async calls to docker. The other has no timeout and is the responsibility of the caller to set a context that will ensure the call eventually terminates. The use of these two clients has been confusing and mistakes were made in a number of places where calls were making use of the wrong client. This PR makes it so that a user must explicitly call a function to get the client that makes sense for that use case. Fixes #17023
* drivers/docker: refactor use of clients in docker driver This PR refactors how we manage the two underlying clients used by the docker driver for communicating with the docker daemon. We keep two clients - one with a hard-coded timeout that applies to all operations no matter what, intended for use with short lived / async calls to docker. The other has no timeout and is the responsibility of the caller to set a context that will ensure the call eventually terminates. The use of these two clients has been confusing and mistakes were made in a number of places where calls were making use of the wrong client. This PR makes it so that a user must explicitly call a function to get the client that makes sense for that use case. Fixes #17023 * cr: followup items
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Describe the bug
The Docker Task drivers' dockerTimeout setting supersedes Nomad's
max_kill_timeout
setting.Steps to reproduce the behavior.
Configure a
max_kill_timeout
for a period longer than 5 minutes. Run a Nomad job that utilizes the docker driver and then initiate a stop. Once an initial signal is sent via nomad alloc stop the docker driver will send another signal in5 * time.Min
Expected behavior
The expected behavior is that Nomad's max_kill_timeout will override task driver timeout settings.
Product Version: Current 1.5.x
So we see the messaged Killing Sent interrupt. Waiting 10m0s before force killing which looks good. However, if I tail the alloc logs for the container I see an additional SIGTERM in the output before 10m. That looks like it's coming from docker 1.5.3 docker/config.go
Here is the log output. The first SIGTERM is when I ran nomad alloc stop 0d. I didn't touch Nomad after that and I observed another SIGTERM @ [Tue Aug 2 15:35:52 UTC 2022] running the loop for 71 times; sleeping
The text was updated successfully, but these errors were encountered: