Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Task is not restarted in the first template render after agent restarting #6638

Closed
liujingchen opened this issue Nov 7, 2019 · 3 comments

Comments

@liujingchen
Copy link

Nomad version

Nomad v0.10.0 (25ee121)
(But this should happen in all versions after 0.9)

Operating system and Environment details

Ubuntu 18.04
with Consul 1.6.1 and Vault 1.2.1

Issue

When Nomad client agent is restarted, it renders all the templates once for the existing tasks it manages, but will not restart the task if the rendered file content changed.
I guess the related code is here or here. The events of first rendering seems handled differently from the usual render events, without checking the change_mode to restart the tasks, which makes sense to newly started tasks, but not so right for existing tasks working with Vault together.

What happened to me was:

A few days ago, I restarted all our Nomad agents when upgrading to 0.10.0. Here are the related logs from nomad agent (sorry, has to mask some info since it is our product environment):

....
Nov 01 16:02:29 ip-****** systemd[1]: Stopped Nomad Cluster Manager.
Nov 01 16:02:29 ip-****** systemd[1]: Started Nomad Cluster Manager.
Nov 01 16:02:29 ip-****** nomad[32274]: ==> Loaded configuration from 
...
Nov 01 16:02:29 ip-****** nomad[32274]: ==> Starting Nomad agent...
Nov 01 16:02:29 ip-****** nomad[32274]: ==> Nomad agent configuration:
Nov 01 16:02:29 ip-****** nomad[32274]:        Advertise Addrs: HTTP: ***.***.***.***:4646
Nov 01 16:02:29 ip-****** nomad[32274]:             Bind Addrs: HTTP: 0.0.0.0:4646
Nov 01 16:02:29 ip-****** nomad[32274]:                 Client: true
Nov 01 16:02:29 ip-****** nomad[32274]:              Log Level: INFO
Nov 01 16:02:29 ip-****** nomad[32274]:                 Region: ******(DC: ******)
Nov 01 16:02:29 ip-****** nomad[32274]:                 Server: false
Nov 01 16:02:29 ip-****** nomad[32274]:                Version: 0.10.0
Nov 01 16:02:29 ip-****** nomad[32274]: ==> Nomad agent started! Log data will stream in below:
.....

Nov 01 16:02:30 ip-****** nomad[32274]:     2019/11/01 16:02:30.149003 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/ecef2c6b-e09b-9cbb-23e6-15d076939b8c/ap
p/secrets/config.env"
Nov 01 16:02:30 ip-****** nomad[32274]:     2019/11/01 16:02:30.189143 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/99a9907c-bff3-d2dd-34e7-ef401a3d9aa4/ap
p/secrets/config.env"
Nov 01 16:02:30 ip-****** nomad[32274]:     2019/11/01 16:02:30.209191 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/cd2731bc-5380-4ca5-3255-3af08125f9ee/ap
p/secrets/config.env"
Nov 01 16:02:39 ip-****** nomad[32274]:     2019-11-01T16:02:39.283+0900 [INFO ] client: node registration complete

Nomad rendered the files for me, but didn't restart the task when some of the rendered files changed. The changed part was database username and password from Vault, probably because the time was close to the expiration time of the old lease.

Since the rendered file is used as environment variables in my task, I could easily confirm this:

jingchen.liu@ip-******:/etc$ sudo docker exec -it 15304b79e832 /bin/bash
bash-4.4$ cd /secrets/
bash-4.4$ ls -l
total 8
-rw-r--r--    1 root     root           373 Nov  1 07:02 config.env
-rw-r--r--    1 root     root            26 Sep 26 10:26 vault_token
bash-4.4$ cat config.env 
.....
DB_USER_NAME='v-token-*****-bhluS*********'
DB_PASSWORD='A1a-z21wl**************'
DB_HOST='************.ap-northeast-1.rds.amazonaws.com'
bash-4.4$ env | grep DB
DB_NAME=************
DB_PASSWORD=A1a-nCtwG********
DB_USER_NAME=v-token-********-OghC28**************
DB_HOST=************.ap-northeast-1.rds.amazonaws.com

You can see the values rendered on disk were different from the actual env output. The application was still using the old username/password, while Nomad thought it has already given the latest username/password. And when the old database account expired today, the application didn't get restarted either and failed to connect to database.

I guess the solution is the "handleFirstRender" code should handle differently for the cases of newly started tasks and existing tasks.

Reproduction steps

You can do same thing as described above: create a job with secrets from Vault, when the lease is close to expiration, restart nomad agent and see the file content changed but task not restarted.

Or, there maybe an easier way (I haven't tried it though...): create a job with template reads value from Consul. Do these very quickly: stop Nomad client agent where the job is running, change the value in Consul, and start Nomad client agent. If the the Nomad server didn't notice the short downtime of the client agent, the task should still running with the changed file content on disk.

And please let me know if there is any other info I can provide to help. Thank you!

@liujingchen
Copy link
Author

Just saw a previous issue describing similar issue: #4226
And they have pull request for fixing:
#6324

Please feel free to close if this is completely duplicated. Thanks!

@tgross
Copy link
Member

tgross commented Nov 8, 2019

I am going to mark this as duplicate and close it, but I just wanted to let you know this is on our radar.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants