Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

template stanza doesn't load env as expected #6112

Open
danaps opened this issue Aug 12, 2019 · 17 comments
Open

template stanza doesn't load env as expected #6112

danaps opened this issue Aug 12, 2019 · 17 comments

Comments

@danaps
Copy link

danaps commented Aug 12, 2019

Nomad version

Nomad v0.9.4 (a81aa846a45fb8248551b12616287cb57c418cd6)
(encountered the same issue on 0.9.3)

Operating system and Environment details

Ubuntu 16.04.6 LTS
4.4.0-1085-aws #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Issue

when using range service in nomad template stanza and setting env = true it write the output to the destination file but it doesn't always load the env var as expected. the issue happened sporadically.

Reproduction steps

I have a task that watch a service (redis) in consul,
I killed the docker (docker kill) that run the redis service. the task with redis raised again, then the task that watched the redis restarted as well, (used the default change_mode restart) the file was re-renderd but the env var did not load. it happens sporadically and I must add that I had to kill the redis like 3 or 4 times until the issue was reproduced.

Nomad Client logs (if appropriate)

the task that watch redis:

docker ps |grep f5f88d7f
88500180a7b3        62b95289a942                                                         "unicorn -c config/u…"   8 seconds ago       Up 5 seconds        10.101.8.65:23101->3000/tcp, 10.101.8.65:23101->3000/udp               ads-f5f88d7f-4187-c146-eee0-0a48e895f66e

docker exec -it 88500180a7b3 env |grep REDIS
CACHE_REDIS_DB=6
CACHE_REDIS_HOST=10.101.8.200
CACHE_REDIS_PORT=31940
RESQUE_REDIS_DB=7
RESQUE_REDIS_HOST=10.101.8.200
RESQUE_REDIS_PORT=31940
docker exec -it 88500180a7b3 cat /local/services.env

CACHE_REDIS_HOST="10.101.8.200"
CACHE_REDIS_PORT="31940"
RESQUE_REDIS_HOST="10.101.8.200"
RESQUE_REDIS_PORT="31940"
  
MYSQL_HOST="10.101.8.116"
MYSQL_PORT="31599"

the task that run the redis service:

docker ps |grep 68d48fe2
c5868397b5b8        87856cc39862        "docker-entrypoint.s…"   14 minutes ago      Up 14 minutes       10.101.8.200:31940->6379/tcp, 10.101.8.200:31940->6379/udp                                                               redis-68d48fe2-0743-dd26-051a-e0102fab21bc
docker kill c5868397b5b8

docker ps |grep 68d48fe2
97109d3f2dd0        87856cc39862        "docker-entrypoint.s…"   8 seconds ago       Up 7 seconds        10.101.8.200:31940->6379/tcp, 10.101.8.200:31940->6379/udp                                                               redis-68d48fe2-0743-dd26-051a-e0102fab21bc

again the task that watch redis, you can see that there are now missing envs:

docker ps |grep f5f88d7f
8d14dab4e756        402837048690.dkr.ecr.us-east-1.amazonaws.com/ads:master              "unicorn -c config/u…"   36 seconds ago      Up 34 seconds       10.101.8.65:23101->3000/tcp, 10.101.8.65:23101->3000/udp               ads-f5f88d7f-4187-c146-eee0-0a48e895f66e
docker exec -it 8d14dab4e756 cat /local/services.env

CACHE_REDIS_HOST="10.101.8.200"
CACHE_REDIS_PORT="31940"
RESQUE_REDIS_HOST="10.101.8.200"
RESQUE_REDIS_PORT="31940"

MYSQL_HOST="10.101.8.116"
MYSQL_PORT="31599"

docker exec -it 8d14dab4e756 env |grep REDIS
CACHE_REDIS_DB=6
RESQUE_REDIS_DB=7

nomad logs:

Aug 12 10:59:04 ip-10-101-8-65 nomad[4493]:     2019/08/12 10:59:04.746979 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 10:59:04 ip-10-101-8-65 nomad[4493]:     2019/08/12 10:59:04.754962 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.044Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads error="container stopped"
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.065Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads reason= delay=0s
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.072Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads @module=logmon path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stdout.fifo timestamp=2019-08-12T10:59:08.070Z
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.076Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stderr.fifo @module=logmon timestamp=2019-08-12T10:59:08.072Z
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.194Z [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=ads binds="[]string{"/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc:/alloc", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local:/local", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/secrets:/secrets"}"
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.194Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=ads container_name=ads-f5f88d7f-4187-c146-eee0-0a48e895f66e
Aug 12 11:00:02 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:02.604545 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:02 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:02.631092 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:00:08.419Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads error="container stopped"
Aug 12 11:00:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:00:08.712Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads reason= delay=0s
Aug 12 11:00:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:00:08.721Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stdout.fifo @module=logmon timestamp=2019-08-12T11:00:08.718Z
Aug 12 11:00:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:00:08.724Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads @module=logmon path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stderr.fifo timestamp=2019-08-12T11:00:08.718Z
Aug 12 11:00:28 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:28.861960 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:29 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:29.521514 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:29 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:29.523164 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:30 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:30.835442 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:37 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:37.843632 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:37 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:37.847395 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:02:09 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:02:09.075Z [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=ads binds="[]string{"/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc:/alloc", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local:/local", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/secrets:/secrets"}"
Aug 12 11:02:09 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:02:09.075Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=ads container_name=ads-f5f88d7f-4187-c146-eee0-0a48e895f66e
Aug 12 11:14:44 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:14:44.390644 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:14:44 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:14:44.397878 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:14:49 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:14:49.171Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads error="container stopped"
Aug 12 11:14:52 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:14:52.922Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads reason= delay=0s
Aug 12 11:14:52 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:14:52.933Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads @module=logmon path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stdout.fifo timestamp=2019-08-12T11:14:52.929Z
Aug 12 11:14:52 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:14:52.933Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads @module=logmon path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stderr.fifo timestamp=2019-08-12T11:14:52.929Z
Aug 12 11:15:11 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:11.163577 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:13 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:13.155859 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:13 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:13.156466 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:13 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:13.738310 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:13 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:13.739211 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:19 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:19.124726 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:17:26 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:17:26.334Z [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=ads binds="[]string{"/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc:/alloc", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local:/local", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/secrets:/secrets"}"
Aug 12 11:17:26 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:17:26.335Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=ads container_name=ads-f5f88d7f-4187-c146-eee0-0a48e895f66e```
@danaps
Copy link
Author

danaps commented Aug 12, 2019

It is kind of similar to this:
#3498
but I also encountered this issue when there was only one template stanza (Unfortunately i didn't kept the log, it was kibana docker that was rendering elasticsearch service and only that)
@burdandrei :)

@burdandrei
Copy link
Contributor

@danaps could you please post the job definition?

@danaps
Copy link
Author

danaps commented Aug 13, 2019

{
    "Job": {
        "AllAtOnce": false,
        "Constraints": [
            {
                "LTarget": "${attr.kernel.name}",
                "Operand": "=",
                "RTarget": "linux"
            },
            {
                "LTarget": "${node.class}",
                "Operand": "=",
                "RTarget": "093-asg-spot"
            }
        ],
        "CreateIndex": 263893356,
        "Datacenters": [
            "staging-us-east-1"
        ],
        "Dispatched": false,
        "ID": "danap-stack-ads",
        "JobModifyIndex": 263893356,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 263895417,
        "Name": "danap-stack-ads",
        "Namespace": "default",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "staging-us-east-1",
        "Reschedule": null,
        "Stable": false,
        "Status": "running",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1565692620717080477,
        "TaskGroups": [
            {
                "Constraints": [
                    {
                        "LTarget": "",
                        "Operand": "distinct_hosts",
                        "RTarget": "true"
                    },
                    {
                        "LTarget": "${attr.vault.version}",
                        "Operand": "version",
                        "RTarget": ">= 0.6.1"
                    }
                ],
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "ads",
                "ReschedulePolicy": {
                    "Attempts": 3,
                    "Delay": 30000000000,
                    "DelayFunction": "constant",
                    "Interval": 3600000000000,
                    "MaxDelay": 3600000000000,
                    "Unlimited": false
                },
                "RestartPolicy": {
                    "Attempts": 4,
                    "Delay": 25000000000,
                    "Interval": 300000000000,
                    "Mode": "delay"
                },
                "Tasks": [
                    {
                        "Artifacts": null,
                        "Config": {
                            "force_pull": true,
                            "image": "",
                            "logging": [
                                {
                                    "config": [
                                        {
                                            "gelf-address": "udp://0.0.0.0:5000",
                                            "tag": "{\"es_index\":\"danap\", \"nomad_job_name\":\"${NOMAD_JOB_NAME}\",\"nomad_group_name\":\"${NOMAD_GROUP_NAME}\",\"nomad_task_name\":\"${NOMAD_TASK_NAME}\",\"nomad_alloc_id\":\"${NOMAD_ALLOC_ID}\"}"
                                        }
                                    ],
                                    "type": "gelf"
                                }
                            ],
                            "port_map": [
                                {
                                    "unicorn": 3000.0
                                }
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": {
                            "RESQUE_REDIS_DB": "7",
                            "_VERSION": "2019-08-13 10:37:00.195750017 +0000 UTC m=+0.019961268",
                            "CACHE_REDIS_DB": "6",
                            "LOGSTASH_SYNC": "true",
                            "MYSQL_DATABASE": "ads_danap",
                            "RAILS_ENV": "staging"
                        },
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "ads",
                        "Resources": {
                            "CPU": 1000,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 1000,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": [
                                        {
                                            "Label": "unicorn",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10,
                                    "ReservedPorts": null
                                }
                            ]
                        },
                        "Services": [
                            {
                                "AddressMode": "auto",
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "",
                                        "Args": null,
                                        "CheckRestart": null,
                                        "Command": "",
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "Id": "",
                                        "InitialStatus": "",
                                        "Interval": 5000000000,
                                        "Method": "",
                                        "Name": "HTTP Health",
                                        "Path": "/ping",
                                        "PortLabel": "unicorn",
                                        "Protocol": "",
                                        "TLSSkipVerify": false,
                                        "Timeout": 3000000000,
                                        "Type": "http"
                                    }
                                ],
                                "Id": "",
                                "Name": "danap-stack-ads",
                                "PortLabel": "unicorn",
                                "Tags": [
               
                                ]
                            }
                        ],
                        "ShutdownDelay": 0,
                        "Templates": [
                            {
                                "ChangeMode": "restart",
                                "ChangeSignal": "",
                                "DestPath": "local/services.env",
                                "EmbeddedTmpl": "\n  {{ range service \"danap-stack-redis\" }}\n    CACHE_REDIS_HOST=\"{{ .Address }}\"\n    CACHE_REDIS_PORT=\"{{ .Port }}\"\n    RESQUE_REDIS_HOST=\"{{ .Address }}\"\n    RESQUE_REDIS_PORT=\"{{ .Port }}\"\n  {{ end }}\n\n  {{ range service \"danap-stack-mysql\" }}\n    MYSQL_HOST=\"{{ .Address }}\"\n    MYSQL_PORT=\"{{ .Port }}\"\n  {{ end }}\n  ",
                                "Envvars": true,
                                "LeftDelim": "{{",
                                "Perms": "0644",
                                "RightDelim": "}}",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "VaultGrace": 15000000000
                            }
                        ],
                        "User": "",
                        "Vault": {
                            "ChangeMode": "restart",
                            "ChangeSignal": "SIGHUP",
                            "Env": true,
                            "Policies": [
                                ""
                            ]
                        }
                    }
                ],
                "Update": null
            }
        ],
        "Type": "service",
        "Update": {
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 0,
            "MinHealthyTime": 0,
            "ProgressDeadline": 0,
            "Stagger": 0
        },
        "VaultToken": "",
        "Version": 0
    }
}

@elifish4
Copy link

@dadgar Can you assist please? :)

@cgbaker
Copy link
Contributor

cgbaker commented Aug 22, 2019

@elifish4 , I'm looking at this right now.

@cgbaker
Copy link
Contributor

cgbaker commented Aug 30, 2019

@danaps , we've found a race condition around this use case. The issue is that when Nomad registers the upstream service (redis, in your example), it initially comes up as healthy, then the checks are registered and the service is critical, then the checks start passing and the service is healthy again. The result is that the consul-template can render three times; each one of these will try to restart the downstream container (ads, in your example). Depending on how these (restart requests, template renders, and restart attempts) all line up, the race condition can result in the running container having an environment that is incorrect.

This is a bug, which will be addressed. I appreciate your help in tracking this one down. In the short term, increasing the splay parameter in the template reduces the likelihood of this happening (albeit, by giving the container multiple opportunities to restart). Similarly, increasing the check interval for the upstream services can have the same effect.

I will leave this issue open and update it later before this fix is shipped.

@elifish4
Copy link

elifish4 commented Sep 1, 2019

thanks @cgbaker

@cgbaker cgbaker modified the milestones: near-term, 0.10.1 Sep 3, 2019
@danaps
Copy link
Author

danaps commented Sep 4, 2019

@cgbaker, Thank you

@jtrivino95
Copy link
Contributor

The workaround we went for to solve this is to add a check in the entrypoint of the container. If the envvar is not there, the container will exit. This way the container will be restarted until the template is loaded correctly in the environment.

    if [[ -z $CELERY_BROKER_URL ]]; then
      exit 1;
    fi

@cgbaker cgbaker removed their assignment Oct 29, 2019
@cgbaker cgbaker self-assigned this Oct 29, 2019
@cgbaker cgbaker modified the milestones: 0.10.1, 0.10.2 Oct 31, 2019
@cgbaker cgbaker removed their assignment Nov 20, 2019
@cgbaker cgbaker modified the milestones: 0.10.2, 0.10.3 Nov 20, 2019
@schmichael schmichael removed this from the 0.10.3 milestone Dec 6, 2019
@lbachar
Copy link

lbachar commented Dec 30, 2019

When are you going to merge the fix? @schmichael @cgbaker

@NetanelMoshkovitz
Copy link

NetanelMoshkovitz commented Apr 27, 2020

@schmichael @cgbaker, any news about this issue?

@lbachar
Copy link

lbachar commented Sep 13, 2020

The bug still occurs on Nomad ver 0.11.4

@burdandrei
Copy link
Contributor

how is this not a critical bug?
#3498 was addressed pretty fast.

Having a template re-rendered but env not updated is crucial for anyone running 12-factor like apps

@tgross
Copy link
Member

tgross commented Sep 18, 2020

@burdandrei we have investigated this one at length and it's decidedly non-trivial, and we have to trade off resources between any given bug, support for enterprise customers, and other feature development. The workaround that @jtrivino95 provided above will handle this case, and in general it's a good practice for applications to assert their dependencies. As you might imagine, bugs with a reasonable workaround tend to fall behind those without.

Just to dump the results of some of our investigations here... the bug is reproducible, albeit not 100% of the time, with the following set of jobs:

postgres.nomad
job "database" {
  datacenters = ["dc1"]

  group "database" {

    restart {
      attempts = 100
      delay    = "3s"
    }

    task "postgres" {
      driver = "docker"

      config {
        image = "postgres"
        port_map {
          db = 5432
        }
      }

      resources {
        cpu    = 500
        memory = 256
        network {
          mbits = 10
          port "db" {}
        }
      }

      service {
        name = "postgres"
        tags = ["global", "postgres"]
        port = "db"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}
nginx.nomad
job "web" {
  datacenters = ["dc1"]

  group "web" {
    task "nginx" {
      driver = "docker"

      config {
        image = "nginx:alpine"
        port_map {
          http = 80
        }
      }

      resources {
        cpu    = 500
        memory = 256
        network {
          mbits = 10
          port "http" {}
        }
      }

      template {
        change_mode = "restart"
        destination = "local/services.env"
        splay       = "0s"
        env         = true,
        data        = <<EOH
           {{ range service "redis" }}
             UPSTREAM_REDIS_HOST="{{ .Address }}"
             UPSTREAM_REDIS_PORT="{{ .Port }}"
           {{ end }}
           {{ range service "postgres" }}
             UPSTREAM_PSQL_HOST="{{ .Address }}"
             UPSTREAM_PSQL_PORT="{{ .Port }}"
           {{ end }}
        EOH
      }
    }
  }
}
redis.nomad
job "cache" {
  datacenters = ["dc1"]

  group "cache" {

    restart {
      attempts = 100
      delay    = "3s"
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }

      resources {
        cpu    = 500
        memory = 256
        network {
          mbits = 10
          port "db" {}
        }
      }

      service {
        name = "redis"
        tags = ["global", "cache"]
        port = "db"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}
  1. Run Consul and Nomad
  2. Run postgres, redis and nginx jobs
  3. Monitor the environment in the nginx container: watch -n 0.1 'docker ps | grep nginx | awk "{print \$1}" | xargs -I_ docker exec _ env | grep UPSTREAM'
  4. Monitor the services in the nginx container: watch -n 0.1 'docker ps | grep nginx | awk "{print \$1}" | xargs -I_ docker exec _ cat /local/services.env'
  5. Kill either the postgres or redis containers.

At that point, the nginx allocation will restart without the killed upstream. After the killed task is restarted, it should show up in the rendered template.

  • Expected behavior: the associated environment variable shows up in the container process env.
  • Reproduced bug: the associated environment variable does not show up in the container process env.

The diagram below illustrates the window of the race:

screenshot

As you can see, it's not just a matter of atomically registering the Consul service and the health check at the same time (which would be #3935). And it's not quite as easy as merging some data structures as we did with #3498. But if this is an area that interests you, we'd love to hear your ideas! Thanks!

@burdandrei
Copy link
Contributor

Thanks for so comprehensive answer @tgross!
Sorry if it was looking like I'm trying to manage nomad's team resources. It's clear that the team got a lot on its plate.
To be honest, I never dug the code to follow the templates flow, just assumed that after every rewrite restart is queued to ensure that all the environment variables are set to be in place.

P.S. timeline graph is awesome!

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jan 26, 2021
@jpatidar30
Copy link

Any updates on this bug? is it fixed in the newer versions? we facing this issue in nomad version 1.8.0

@jrasell
Copy link
Member

jrasell commented Jun 27, 2024

Hi @jpatidar30 there are no updates currently. When an engineer is assigned to this and working on it, updates will be provided in the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests