Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script checks fails to update TTL #6836

Closed
jorgemarey opened this issue Dec 11, 2019 · 6 comments · Fixed by #6916
Closed

Script checks fails to update TTL #6836

jorgemarey opened this issue Dec 11, 2019 · 6 comments · Fixed by #6916
Assignees
Milestone

Comments

@jorgemarey
Copy link
Contributor

Nomad version

v0.10.2

Issue

Script checks fail to update in Consul.

For what I could investigate. This is due to service name interpolation not being done in the script check hook.

As service name is not interpolated in that hook, the checkID generated by the hash function is different to the one registered in consul.

Reproduction steps

  1. create a service with a script check (use interpolation in the service name)
  2. run job

Nomad Client logs

Dec 11 07:46:25 w-315cfc94-0005 nomad: {"@level":"warn","@message":"updating check failed","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:25.325032Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-fcdfa8b4920e8e49f8361e7a82b9d772f5c90bfa\")","task":"my-task"}
Dec 11 07:46:25 w-315cfc94-0005 nomad: {"@level":"warn","@message":"updating check failed","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:25.704519Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-f9d3b979a85ba3c79f9e54bed95fbc6ba49a5827\")","task":"my-task"}
Dec 11 07:46:40 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:40.287374Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-fcdfa8b4920e8e49f8361e7a82b9d772f5c90bfa\")","task":"my-task"}
Dec 11 07:46:40 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:40.721120Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-f9d3b979a85ba3c79f9e54bed95fbc6ba49a5827\")","task":"my-task"}
Dec 11 07:46:55 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:55.307906Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-fcdfa8b4920e8e49f8361e7a82b9d772f5c90bfa\")","task":"my-task"}
Dec 11 07:46:55 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:55.655170Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-f9d3b979a85ba3c79f9e54bed95fbc6ba49a5827\")","task":"my-task"}

I made the following change to make it work in our environment.

diff --git a/client/allocrunner/taskrunner/script_check_hook.go b/client/allocrunner/taskrunner/script_check_hook.go
index b40e92301..4916ef76e 100644
--- a/client/allocrunner/taskrunner/script_check_hook.go
+++ b/client/allocrunner/taskrunner/script_check_hook.go
@@ -175,12 +175,15 @@ func (h *scriptCheckHook) Stop(ctx context.Context, req *interfaces.TaskStopRequ
 func (h *scriptCheckHook) newScriptChecks() map[string]*scriptCheck {
        scriptChecks := make(map[string]*scriptCheck)
        for _, service := range h.task.Services {
+               copyService := service.Copy()
+               copyService.Name = h.taskEnv.ReplaceEnv(copyService.Name)
+               copyService.PortLabel = h.taskEnv.ReplaceEnv(service.PortLabel)
                for _, check := range service.Checks {
                        if check.Type != structs.ServiceCheckScript {
                                continue
                        }
                        serviceID := agentconsul.MakeAllocServiceID(
-                               h.alloc.ID, h.task.Name, service)
+                               h.alloc.ID, h.task.Name, copyService)
                        sc := newScriptCheck(&scriptCheckConfig{
                                allocID:    h.alloc.ID,
                                taskName:   h.task.Name,
@@ -205,6 +208,9 @@ func (h *scriptCheckHook) newScriptChecks() map[string]*scriptCheck {
        // watches Consul for status changes.
        tg := h.alloc.Job.LookupTaskGroup(h.alloc.TaskGroup)
        for _, service := range tg.Services {
+               copyService := service.Copy()
+               copyService.Name = h.taskEnv.ReplaceEnv(copyService.Name)
+               copyService.PortLabel = h.taskEnv.ReplaceEnv(service.PortLabel)
                for _, check := range service.Checks {
                        if check.Type != structs.ServiceCheckScript {
                                continue
@@ -214,7 +220,7 @@ func (h *scriptCheckHook) newScriptChecks() map[string]*scriptCheck {
                        }
                        groupTaskName := "group-" + tg.Name
                        serviceID := agentconsul.MakeAllocServiceID(
-                               h.alloc.ID, groupTaskName, service)
+                               h.alloc.ID, h.task.Name, copyService)
                        sc := newScriptCheck(&scriptCheckConfig{
                                allocID:    h.alloc.ID,
                                taskName:   groupTaskName,
@tgross
Copy link
Member

tgross commented Dec 11, 2019

Hi @jorgemarey and thanks for reporting this! This may be related to what's going on in #6637 but we'll look into it.

@tgross tgross added this to the 0.10.3 milestone Dec 11, 2019
@tgross tgross self-assigned this Dec 11, 2019
@jorgemarey
Copy link
Contributor Author

Hi @tgross. I don't know if this is related. That issue occurs when performing the validation of the job file and this happens when the agent (client) is running the allocation and trying to update the TTL on consul.

@tgross
Copy link
Member

tgross commented Jan 7, 2020

Hey @jorgemarey, just wanted to let you know I've started on the fix for this. Your patch has the right idea, but we need to move where we're doing the taskEnv interpolation to account for job updates. Once I've got that (and tests!) I'll ping you on the pull request as a heads up.

@tgross
Copy link
Member

tgross commented Jan 8, 2020

I've opened #6916 with the fix.

@consal
Copy link

consal commented Apr 27, 2020

I'm still having the same issue as the original poster, running Nomad v0.10.5

@github-actions
Copy link

github-actions bot commented Nov 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants