Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client configuration max_kill_timeout is not used. #10005

Closed
lgfa29 opened this issue Feb 10, 2021 · 2 comments · Fixed by #13626
Closed

Client configuration max_kill_timeout is not used. #10005

lgfa29 opened this issue Feb 10, 2021 · 2 comments · Fixed by #13626
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/config type/bug
Milestone

Comments

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 10, 2021

Nomad version

Nomad 0.9.0+

Issue

The max_kill_timeout is not being used when determining how long to wait to force kill a task. The default value of 30s is also potentially too low.

Reproduction steps

  1. Start Nomad with default client configuration (nomad agent -dev)
  2. Run the job below.
  3. Stop the job.

Expected result

The job allocation is forced killed after 30s (default max_kill_timeout).

Actual result

The job allocation is only forced killed after 10m (value of kill_timeout in the task).

Job file

job "example" {
  datacenters = ["dc1"]

  group "example" {
    task "example" {
      driver       = "raw_exec"
      kill_timeout = "10m"

      config {
        command = "/bin/bash"
        args    = ["local/script.sh"]
      }

      template {
        data        = <<EOF
trap 'echo "Received SIGINT"' INT
while true
do
  echo 'Running'
  sleep 1
done
EOF
        destination = "local/script.sh"
      }
    }
  }
}

Nomad Client logs (if appropriate)

    2021-02-10T13:57:40.711-0500 [DEBUG] worker: dequeued evaluation: eval_id=67a30a7a-f430-10b1-b929-15cb528a7dc8
    2021-02-10T13:57:40.711-0500 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=67a30a7a-f430-10b1-b929-15cb528a7dc8 job_id=example namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 1)
Desired Changes for "example": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)"
    2021-02-10T13:57:40.711-0500 [DEBUG] http: request complete: method=DELETE path=/v1/job/example?purge=false duration=780.704µs
    2021-02-10T13:57:40.711-0500 [DEBUG] http: request complete: method=GET path=/v1/job/example/evaluations?index=18 duration=15.57450593s
    2021-02-10T13:57:40.711-0500 [DEBUG] http: request complete: method=GET path=/v1/job/example/evaluations?index=18 duration=12.872179523s
    2021-02-10T13:57:40.712-0500 [DEBUG] worker: submitted plan for evaluation: eval_id=67a30a7a-f430-10b1-b929-15cb528a7dc8
    2021-02-10T13:57:40.712-0500 [DEBUG] worker.service_sched: setting eval status: eval_id=67a30a7a-f430-10b1-b929-15cb528a7dc8 job_id=example namespace=default status=complete
    2021-02-10T13:57:40.712-0500 [DEBUG] http: request complete: method=GET path=/v1/job/example/allocations?index=21 duration=9.958447212s
    2021-02-10T13:57:40.712-0500 [DEBUG] http: request complete: method=GET path=/v1/allocation/a4b81f9f-ed37-ab4f-31ce-d76990305cb2?index=21 duration=9.155212188s
    2021-02-10T13:57:40.712-0500 [DEBUG] http: request complete: method=GET path=/v1/evaluation/67a30a7a-f430-10b1-b929-15cb528a7dc8 duration=213.843µs
    2021-02-10T13:57:40.712-0500 [DEBUG] http: request complete: method=GET path=/v1/job/example/allocations?index=21 duration=9.802143642s
    2021-02-10T13:57:40.712-0500 [DEBUG] worker: updated evaluation: eval="<Eval "67a30a7a-f430-10b1-b929-15cb528a7dc8" JobID: "example" Namespace: "default">"
    2021-02-10T13:57:40.712-0500 [DEBUG] worker: ack evaluation: eval_id=67a30a7a-f430-10b1-b929-15cb528a7dc8
    2021-02-10T13:57:40.713-0500 [DEBUG] client: updated allocations: index=24 total=1 pulled=1 filtered=0
    2021-02-10T13:57:40.713-0500 [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=0
    2021-02-10T13:57:40.713-0500 [DEBUG] http: request complete: method=GET path=/v1/evaluation/67a30a7a-f430-10b1-b929-15cb528a7dc8/allocations duration=468.484µs
    2021-02-10T13:57:40.714-0500 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: shutdown requested: alloc_id=a4b81f9f-ed37-ab4f-31ce-d76990305cb2 driver=raw_exec task_name=example @module=executor grace_period_ms=6e+11 signal= timestamp=2021-02-10T13:57:40.713-0500
@mikenomitch
Copy link
Contributor

mikenomitch commented Jul 1, 2022

I was able to repro this on MacOS using raw_exec and Nomad 1.3.1.

I configured a client max_kill_timout of "30s"

I ran a job that had "trap 'echo "Received SIGINT"' INT"

With a job with a kill_timeout of "2m", if I click "Stop Job", I see in logs:
"Sent interrupt. Waiting 2m0s before force killing" and then two minutes pass before the job is force killed.
The "30s" is not respected.

With a job with no kill_timeout, if I click "Stop Job", I see in logs:
"Sent interrupt. Waiting 5s before force killing" and then the job is force killed in 5s.

EDIT:

Some random git spelunking brought up 9f44780 which introduced the max timeout, and in client/driver/driver.go there was a method that did this:

+       max := d.config.MaxKillTimeout.Nanoseconds()
+       desired := task.KillTimeout.Nanoseconds()
+       if desired < max {
+               return task.KillTimeout
+       }
+
+       return d.config.MaxKillTimeout
+}

I cant find any functions like this in the current codebase though. So I think it got removed somewhere and never replaced.

@github-actions
Copy link

github-actions bot commented Nov 5, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/config type/bug
Projects
Development

Successfully merging a pull request may close this issue.

4 participants