template stanza doesn't load env as expected #6112

danaps · 2019-08-12T12:10:21Z

Nomad version

Nomad v0.9.4 (a81aa846a45fb8248551b12616287cb57c418cd6)
(encountered the same issue on 0.9.3)

Operating system and Environment details

Ubuntu 16.04.6 LTS
4.4.0-1085-aws #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Issue

when using range service in nomad template stanza and setting env = true it write the output to the destination file but it doesn't always load the env var as expected. the issue happened sporadically.

Reproduction steps

I have a task that watch a service (redis) in consul,
I killed the docker (docker kill) that run the redis service. the task with redis raised again, then the task that watched the redis restarted as well, (used the default change_mode restart) the file was re-renderd but the env var did not load. it happens sporadically and I must add that I had to kill the redis like 3 or 4 times until the issue was reproduced.

Nomad Client logs (if appropriate)

the task that watch redis:

docker ps |grep f5f88d7f
88500180a7b3        62b95289a942                                                         "unicorn -c config/u…"   8 seconds ago       Up 5 seconds        10.101.8.65:23101->3000/tcp, 10.101.8.65:23101->3000/udp               ads-f5f88d7f-4187-c146-eee0-0a48e895f66e

docker exec -it 88500180a7b3 env |grep REDIS
CACHE_REDIS_DB=6
CACHE_REDIS_HOST=10.101.8.200
CACHE_REDIS_PORT=31940
RESQUE_REDIS_DB=7
RESQUE_REDIS_HOST=10.101.8.200
RESQUE_REDIS_PORT=31940
docker exec -it 88500180a7b3 cat /local/services.env

CACHE_REDIS_HOST="10.101.8.200"
CACHE_REDIS_PORT="31940"
RESQUE_REDIS_HOST="10.101.8.200"
RESQUE_REDIS_PORT="31940"
  
MYSQL_HOST="10.101.8.116"
MYSQL_PORT="31599"

the task that run the redis service:

docker ps |grep 68d48fe2
c5868397b5b8        87856cc39862        "docker-entrypoint.s…"   14 minutes ago      Up 14 minutes       10.101.8.200:31940->6379/tcp, 10.101.8.200:31940->6379/udp                                                               redis-68d48fe2-0743-dd26-051a-e0102fab21bc
docker kill c5868397b5b8

docker ps |grep 68d48fe2
97109d3f2dd0        87856cc39862        "docker-entrypoint.s…"   8 seconds ago       Up 7 seconds        10.101.8.200:31940->6379/tcp, 10.101.8.200:31940->6379/udp                                                               redis-68d48fe2-0743-dd26-051a-e0102fab21bc

again the task that watch redis, you can see that there are now missing envs:

docker ps |grep f5f88d7f
8d14dab4e756        402837048690.dkr.ecr.us-east-1.amazonaws.com/ads:master              "unicorn -c config/u…"   36 seconds ago      Up 34 seconds       10.101.8.65:23101->3000/tcp, 10.101.8.65:23101->3000/udp               ads-f5f88d7f-4187-c146-eee0-0a48e895f66e
docker exec -it 8d14dab4e756 cat /local/services.env

CACHE_REDIS_HOST="10.101.8.200"
CACHE_REDIS_PORT="31940"
RESQUE_REDIS_HOST="10.101.8.200"
RESQUE_REDIS_PORT="31940"

MYSQL_HOST="10.101.8.116"
MYSQL_PORT="31599"

docker exec -it 8d14dab4e756 env |grep REDIS
CACHE_REDIS_DB=6
RESQUE_REDIS_DB=7

nomad logs:

Aug 12 10:59:04 ip-10-101-8-65 nomad[4493]:     2019/08/12 10:59:04.746979 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 10:59:04 ip-10-101-8-65 nomad[4493]:     2019/08/12 10:59:04.754962 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.044Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads error="container stopped"
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.065Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads reason= delay=0s
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.072Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads @module=logmon path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stdout.fifo timestamp=2019-08-12T10:59:08.070Z
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.076Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stderr.fifo @module=logmon timestamp=2019-08-12T10:59:08.072Z
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.194Z [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=ads binds="[]string{"/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc:/alloc", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local:/local", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/secrets:/secrets"}"
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.194Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=ads container_name=ads-f5f88d7f-4187-c146-eee0-0a48e895f66e
Aug 12 11:00:02 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:02.604545 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:02 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:02.631092 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:00:08.419Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads error="container stopped"
Aug 12 11:00:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:00:08.712Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads reason= delay=0s
Aug 12 11:00:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:00:08.721Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stdout.fifo @module=logmon timestamp=2019-08-12T11:00:08.718Z
Aug 12 11:00:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:00:08.724Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads @module=logmon path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stderr.fifo timestamp=2019-08-12T11:00:08.718Z
Aug 12 11:00:28 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:28.861960 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:29 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:29.521514 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:29 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:29.523164 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:30 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:30.835442 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:37 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:37.843632 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:37 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:37.847395 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:02:09 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:02:09.075Z [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=ads binds="[]string{"/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc:/alloc", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local:/local", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/secrets:/secrets"}"
Aug 12 11:02:09 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:02:09.075Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=ads container_name=ads-f5f88d7f-4187-c146-eee0-0a48e895f66e
Aug 12 11:14:44 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:14:44.390644 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:14:44 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:14:44.397878 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:14:49 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:14:49.171Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads error="container stopped"
Aug 12 11:14:52 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:14:52.922Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads reason= delay=0s
Aug 12 11:14:52 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:14:52.933Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads @module=logmon path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stdout.fifo timestamp=2019-08-12T11:14:52.929Z
Aug 12 11:14:52 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:14:52.933Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads @module=logmon path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stderr.fifo timestamp=2019-08-12T11:14:52.929Z
Aug 12 11:15:11 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:11.163577 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:13 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:13.155859 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:13 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:13.156466 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:13 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:13.738310 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:13 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:13.739211 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:19 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:19.124726 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:17:26 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:17:26.334Z [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=ads binds="[]string{"/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc:/alloc", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local:/local", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/secrets:/secrets"}"
Aug 12 11:17:26 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:17:26.335Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=ads container_name=ads-f5f88d7f-4187-c146-eee0-0a48e895f66e```

The text was updated successfully, but these errors were encountered:

danaps · 2019-08-12T18:45:21Z

It is kind of similar to this:
#3498
but I also encountered this issue when there was only one template stanza (Unfortunately i didn't kept the log, it was kibana docker that was rendering elasticsearch service and only that)
@burdandrei :)

burdandrei · 2019-08-13T07:02:49Z

@danaps could you please post the job definition?

danaps · 2019-08-13T10:58:00Z

{
    "Job": {
        "AllAtOnce": false,
        "Constraints": [
            {
                "LTarget": "${attr.kernel.name}",
                "Operand": "=",
                "RTarget": "linux"
            },
            {
                "LTarget": "${node.class}",
                "Operand": "=",
                "RTarget": "093-asg-spot"
            }
        ],
        "CreateIndex": 263893356,
        "Datacenters": [
            "staging-us-east-1"
        ],
        "Dispatched": false,
        "ID": "danap-stack-ads",
        "JobModifyIndex": 263893356,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 263895417,
        "Name": "danap-stack-ads",
        "Namespace": "default",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "staging-us-east-1",
        "Reschedule": null,
        "Stable": false,
        "Status": "running",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1565692620717080477,
        "TaskGroups": [
            {
                "Constraints": [
                    {
                        "LTarget": "",
                        "Operand": "distinct_hosts",
                        "RTarget": "true"
                    },
                    {
                        "LTarget": "${attr.vault.version}",
                        "Operand": "version",
                        "RTarget": ">= 0.6.1"
                    }
                ],
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "ads",
                "ReschedulePolicy": {
                    "Attempts": 3,
                    "Delay": 30000000000,
                    "DelayFunction": "constant",
                    "Interval": 3600000000000,
                    "MaxDelay": 3600000000000,
                    "Unlimited": false
                },
                "RestartPolicy": {
                    "Attempts": 4,
                    "Delay": 25000000000,
                    "Interval": 300000000000,
                    "Mode": "delay"
                },
                "Tasks": [
                    {
                        "Artifacts": null,
                        "Config": {
                            "force_pull": true,
                            "image": "",
                            "logging": [
                                {
                                    "config": [
                                        {
                                            "gelf-address": "udp://0.0.0.0:5000",
                                            "tag": "{\"es_index\":\"danap\", \"nomad_job_name\":\"${NOMAD_JOB_NAME}\",\"nomad_group_name\":\"${NOMAD_GROUP_NAME}\",\"nomad_task_name\":\"${NOMAD_TASK_NAME}\",\"nomad_alloc_id\":\"${NOMAD_ALLOC_ID}\"}"
                                        }
                                    ],
                                    "type": "gelf"
                                }
                            ],
                            "port_map": [
                                {
                                    "unicorn": 3000.0
                                }
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": {
                            "RESQUE_REDIS_DB": "7",
                            "_VERSION": "2019-08-13 10:37:00.195750017 +0000 UTC m=+0.019961268",
                            "CACHE_REDIS_DB": "6",
                            "LOGSTASH_SYNC": "true",
                            "MYSQL_DATABASE": "ads_danap",
                            "RAILS_ENV": "staging"
                        },
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "ads",
                        "Resources": {
                            "CPU": 1000,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 1000,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": [
                                        {
                                            "Label": "unicorn",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10,
                                    "ReservedPorts": null
                                }
                            ]
                        },
                        "Services": [
                            {
                                "AddressMode": "auto",
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "",
                                        "Args": null,
                                        "CheckRestart": null,
                                        "Command": "",
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "Id": "",
                                        "InitialStatus": "",
                                        "Interval": 5000000000,
                                        "Method": "",
                                        "Name": "HTTP Health",
                                        "Path": "/ping",
                                        "PortLabel": "unicorn",
                                        "Protocol": "",
                                        "TLSSkipVerify": false,
                                        "Timeout": 3000000000,
                                        "Type": "http"
                                    }
                                ],
                                "Id": "",
                                "Name": "danap-stack-ads",
                                "PortLabel": "unicorn",
                                "Tags": [
               
                                ]
                            }
                        ],
                        "ShutdownDelay": 0,
                        "Templates": [
                            {
                                "ChangeMode": "restart",
                                "ChangeSignal": "",
                                "DestPath": "local/services.env",
                                "EmbeddedTmpl": "\n  {{ range service \"danap-stack-redis\" }}\n    CACHE_REDIS_HOST=\"{{ .Address }}\"\n    CACHE_REDIS_PORT=\"{{ .Port }}\"\n    RESQUE_REDIS_HOST=\"{{ .Address }}\"\n    RESQUE_REDIS_PORT=\"{{ .Port }}\"\n  {{ end }}\n\n  {{ range service \"danap-stack-mysql\" }}\n    MYSQL_HOST=\"{{ .Address }}\"\n    MYSQL_PORT=\"{{ .Port }}\"\n  {{ end }}\n  ",
                                "Envvars": true,
                                "LeftDelim": "{{",
                                "Perms": "0644",
                                "RightDelim": "}}",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "VaultGrace": 15000000000
                            }
                        ],
                        "User": "",
                        "Vault": {
                            "ChangeMode": "restart",
                            "ChangeSignal": "SIGHUP",
                            "Env": true,
                            "Policies": [
                                ""
                            ]
                        }
                    }
                ],
                "Update": null
            }
        ],
        "Type": "service",
        "Update": {
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 0,
            "MinHealthyTime": 0,
            "ProgressDeadline": 0,
            "Stagger": 0
        },
        "VaultToken": "",
        "Version": 0
    }
}

elifish4 · 2019-08-22T08:47:36Z

@dadgar Can you assist please? :)

cgbaker · 2019-08-22T20:56:30Z

@elifish4 , I'm looking at this right now.

cgbaker · 2019-08-30T16:47:05Z

@danaps , we've found a race condition around this use case. The issue is that when Nomad registers the upstream service (redis, in your example), it initially comes up as healthy, then the checks are registered and the service is critical, then the checks start passing and the service is healthy again. The result is that the consul-template can render three times; each one of these will try to restart the downstream container (ads, in your example). Depending on how these (restart requests, template renders, and restart attempts) all line up, the race condition can result in the running container having an environment that is incorrect.

This is a bug, which will be addressed. I appreciate your help in tracking this one down. In the short term, increasing the splay parameter in the template reduces the likelihood of this happening (albeit, by giving the container multiple opportunities to restart). Similarly, increasing the check interval for the upstream services can have the same effect.

I will leave this issue open and update it later before this fix is shipped.

elifish4 · 2019-09-01T07:01:52Z

thanks @cgbaker

danaps · 2019-09-04T06:52:12Z

@cgbaker, Thank you

jtrivino95 · 2019-10-29T10:32:53Z

The workaround we went for to solve this is to add a check in the entrypoint of the container. If the envvar is not there, the container will exit. This way the container will be restarted until the template is loaded correctly in the environment.

    if [[ -z $CELERY_BROKER_URL ]]; then
      exit 1;
    fi

lbachar · 2019-12-30T10:14:17Z

When are you going to merge the fix? @schmichael @cgbaker

NetanelMoshkovitz · 2020-04-27T08:53:02Z

@schmichael @cgbaker, any news about this issue?

lbachar · 2020-09-13T13:40:51Z

The bug still occurs on Nomad ver 0.11.4

burdandrei · 2020-09-14T07:37:00Z

how is this not a critical bug?
#3498 was addressed pretty fast.

Having a template re-rendered but env not updated is crucial for anyone running 12-factor like apps

tgross · 2020-09-18T12:50:32Z

@burdandrei we have investigated this one at length and it's decidedly non-trivial, and we have to trade off resources between any given bug, support for enterprise customers, and other feature development. The workaround that @jtrivino95 provided above will handle this case, and in general it's a good practice for applications to assert their dependencies. As you might imagine, bugs with a reasonable workaround tend to fall behind those without.

Just to dump the results of some of our investigations here... the bug is reproducible, albeit not 100% of the time, with the following set of jobs:

postgres.nomad

job "database" {
  datacenters = ["dc1"]

  group "database" {

    restart {
      attempts = 100
      delay    = "3s"
    }

    task "postgres" {
      driver = "docker"

      config {
        image = "postgres"
        port_map {
          db = 5432
        }
      }

      resources {
        cpu    = 500
        memory = 256
        network {
          mbits = 10
          port "db" {}
        }
      }

      service {
        name = "postgres"
        tags = ["global", "postgres"]
        port = "db"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

nginx.nomad

job "web" {
  datacenters = ["dc1"]

  group "web" {
    task "nginx" {
      driver = "docker"

      config {
        image = "nginx:alpine"
        port_map {
          http = 80
        }
      }

      resources {
        cpu    = 500
        memory = 256
        network {
          mbits = 10
          port "http" {}
        }
      }

      template {
        change_mode = "restart"
        destination = "local/services.env"
        splay       = "0s"
        env         = true,
        data        = <<EOH
           {{ range service "redis" }}
             UPSTREAM_REDIS_HOST="{{ .Address }}"
             UPSTREAM_REDIS_PORT="{{ .Port }}"
           {{ end }}
           {{ range service "postgres" }}
             UPSTREAM_PSQL_HOST="{{ .Address }}"
             UPSTREAM_PSQL_PORT="{{ .Port }}"
           {{ end }}
        EOH
      }
    }
  }
}

redis.nomad

job "cache" {
  datacenters = ["dc1"]

  group "cache" {

    restart {
      attempts = 100
      delay    = "3s"
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }

      resources {
        cpu    = 500
        memory = 256
        network {
          mbits = 10
          port "db" {}
        }
      }

      service {
        name = "redis"
        tags = ["global", "cache"]
        port = "db"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Run Consul and Nomad
Run postgres, redis and nginx jobs
Monitor the environment in the nginx container: watch -n 0.1 'docker ps | grep nginx | awk "{print \$1}" | xargs -I_ docker exec _ env | grep UPSTREAM'
Monitor the services in the nginx container: watch -n 0.1 'docker ps | grep nginx | awk "{print \$1}" | xargs -I_ docker exec _ cat /local/services.env'
Kill either the postgres or redis containers.

At that point, the nginx allocation will restart without the killed upstream. After the killed task is restarted, it should show up in the rendered template.

Expected behavior: the associated environment variable shows up in the container process env.
Reproduced bug: the associated environment variable does not show up in the container process env.

The diagram below illustrates the window of the race:

As you can see, it's not just a matter of atomically registering the Consul service and the health check at the same time (which would be #3935). And it's not quite as easy as merging some data structures as we did with #3498. But if this is an area that interests you, we'd love to hear your ideas! Thanks!

burdandrei · 2020-09-21T09:50:57Z

Thanks for so comprehensive answer @tgross!
Sorry if it was looking like I'm trying to manage nomad's team resources. It's clear that the team got a lot on its plate.
To be honest, I never dug the code to follow the templates flow, just assumed that after every rewrite restart is queued to ensure that all the environment variables are set to be in place.

P.S. timeline graph is awesome!

jpatidar30 · 2024-06-27T10:55:20Z

Any updates on this bug? is it fixed in the newer versions? we facing this issue in nomad version 1.8.0

jrasell · 2024-06-27T11:13:48Z

Hi @jpatidar30 there are no updates currently. When an engineer is assigned to this and working on it, updates will be provided in the issue.

cgbaker self-assigned this Aug 22, 2019

langmartin mentioned this issue Aug 22, 2019

Nomad template not reading env variables if generated dynamically. #5982

Closed

cgbaker added type/bug theme/client theme/consul theme/consul-template labels Aug 30, 2019

cgbaker added this to the near-term milestone Aug 30, 2019

cgbaker modified the milestones: near-term, 0.10.1 Sep 3, 2019

cgbaker removed their assignment Oct 29, 2019

cgbaker self-assigned this Oct 29, 2019

cgbaker modified the milestones: 0.10.1, 0.10.2 Oct 31, 2019

cgbaker removed their assignment Nov 20, 2019

cgbaker modified the milestones: 0.10.2, 0.10.3 Nov 20, 2019

schmichael removed this from the 0.10.3 milestone Dec 6, 2019

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jan 26, 2021

jrasell added the theme/service-discovery/consul label Apr 11, 2022

OfficerKoo mentioned this issue May 23, 2024

Sometimes template stanza does not reload allocations #22224

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

template stanza doesn't load env as expected #6112

template stanza doesn't load env as expected #6112

danaps commented Aug 12, 2019

danaps commented Aug 12, 2019

burdandrei commented Aug 13, 2019

danaps commented Aug 13, 2019

elifish4 commented Aug 22, 2019

cgbaker commented Aug 22, 2019

cgbaker commented Aug 30, 2019

elifish4 commented Sep 1, 2019

danaps commented Sep 4, 2019

jtrivino95 commented Oct 29, 2019

lbachar commented Dec 30, 2019 •

edited

Loading

NetanelMoshkovitz commented Apr 27, 2020 •

edited

Loading

lbachar commented Sep 13, 2020 •

edited

Loading

burdandrei commented Sep 14, 2020

tgross commented Sep 18, 2020

burdandrei commented Sep 21, 2020

jpatidar30 commented Jun 27, 2024

jrasell commented Jun 27, 2024 •

edited

Loading

template stanza doesn't load env as expected #6112

template stanza doesn't load env as expected #6112

Comments

danaps commented Aug 12, 2019

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Client logs (if appropriate)

danaps commented Aug 12, 2019

burdandrei commented Aug 13, 2019

danaps commented Aug 13, 2019

elifish4 commented Aug 22, 2019

cgbaker commented Aug 22, 2019

cgbaker commented Aug 30, 2019

elifish4 commented Sep 1, 2019

danaps commented Sep 4, 2019

jtrivino95 commented Oct 29, 2019

lbachar commented Dec 30, 2019 • edited Loading

NetanelMoshkovitz commented Apr 27, 2020 • edited Loading

lbachar commented Sep 13, 2020 • edited Loading

burdandrei commented Sep 14, 2020

tgross commented Sep 18, 2020

burdandrei commented Sep 21, 2020

jpatidar30 commented Jun 27, 2024

jrasell commented Jun 27, 2024 • edited Loading

lbachar commented Dec 30, 2019 •

edited

Loading

NetanelMoshkovitz commented Apr 27, 2020 •

edited

Loading

lbachar commented Sep 13, 2020 •

edited

Loading

jrasell commented Jun 27, 2024 •

edited

Loading