Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.6.3 Issue with templates being re-rendered every few seconds. #3197

Closed
adamlc opened this issue Sep 12, 2017 · 39 comments
Closed

0.6.3 Issue with templates being re-rendered every few seconds. #3197

adamlc opened this issue Sep 12, 2017 · 39 comments

Comments

@adamlc
Copy link

adamlc commented Sep 12, 2017

Nomad version

Nomad v0.6.3

Operating system and Environment details

centos-release-7-3.1611.el7.centos.x86_64

Issue

We're currently on Nomad 0.6.0 in production with no issues. After trying a new 0.6.3 we're getting problems with our templates on jobs. Every few seconds the template is re-rendered and a SIGHUP us sent to the containers. This happens continuously even though the data in consul / vault isn't changing. I think its vault related because templates without vault secrets in seem OK.

Reproduction steps

I have a simple node AMQP consumer that connects to RabbitMQ and or internal API. The only dynamic part of the template is the AMQP settings.

Nomad Server logs (if appropriate)

Nothing seems out of the ordinary here. Just the server joining the new cluster.

Sep 12 07:50:01 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:01 [INFO] serf: EventMemberJoin: nomad-server-eu-west-1c-i-05356c3336b356b13.global 10.0.2.166
Sep 12 07:50:01 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:01.540049 [INFO] nomad: adding server nomad-server-eu-west-1c-i-05356c3336b356b13.global (Addr: 10.0.2.166:4647) (DC: dc1)
Sep 12 07:50:01 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:01.549204 [INFO] nomad: Found expected number of peers (10.0.2.166:4647,10.0.1.24:4647,10.0.0.45:4647), attempting to bootstrap cluster...
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [WARN] raft: Heartbeat timeout from "" reached, starting election
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: Node at 10.0.1.24:4647 [Candidate] entering Candidate state in term 2
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: Election won. Tally: 2
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: Node at 10.0.1.24:4647 [Leader] entering Leader state
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: Added peer 10.0.2.166:4647, starting replication
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: Added peer 10.0.0.45:4647, starting replication
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02.877179 [INFO] nomad: cluster leadership acquired
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [WARN] raft: AppendEntries to {Voter 10.0.0.45:4647 10.0.0.45:4647} rejected, sending older logs (next: 1)
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [WARN] raft: AppendEntries to {Voter 10.0.2.166:4647 10.0.2.166:4647} rejected, sending older logs (next: 1)
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: pipelining replication to peer {Voter 10.0.0.45:4647 10.0.0.45:4647}
Sep 12 07:50:02 nomad-server-eu-west-1a-i-0f2fb9db6f8f5cc6a nomad[4140]: 2017/09/12 07:50:02 [INFO] raft: pipelining replication to peer {Voter 10.0.2.166:4647 10.0.2.166:4647}

Nomad Client logs (if appropriate)

Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: ==> Nomad agent started! Log data will stream in below:
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.181939 [INFO] client: using state directory /var/lib/nomad/client
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.186149 [INFO] client: using alloc directory /var/lib/nomad/alloc
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.187271 [INFO] fingerprint.cgroups: cgroups are available
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.188386 [INFO] fingerprint.consul: consul agent is available
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.193260 [WARN] fingerprint.network: Unable to parse Speed in output of '/usr/sbin/ethtool eth0'
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.458078 [INFO] fingerprint.vault: Vault is available
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.488889 [INFO] client: Node ID "58bba806-9bd8-b76c-947b-ab34b7d755e4"
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.503844 [INFO] client.consul: discovered following Servers: 10.0.2.166:4647,10.0.1.24:4647,10.0.0.45:4647
Sep 12 07:54:42 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 07:54:42.527839 [INFO] client: node registration complete
Sep 12 08:00:07 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:07 [INFO] (runner) creating new runner (dry: false, once: false)
Sep 12 08:00:07 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:07 [INFO] (runner) creating watcher
Sep 12 08:00:07 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:07 [INFO] (runner) starting
Sep 12 08:00:07 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:07 [INFO] (runner) initiating run
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [INFO] (runner) initiating run
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [INFO] (runner) initiating run
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): failed to renew: Error making API request.
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: URL: PUT https://vault-internal.jumplead.com:8200/v1/sys/leases/renew
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: Code: 403. Errors:
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: * permission denied
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): renewer returned (maybe the lease expired)
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/9d0e6f7a-c7cb-3e57-295e-621b8fe04a8f/server/local/config.json"
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): failed to renew: Error making API request.
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: URL: PUT https://vault-internal.jumplead.com:8200/v1/sys/leases/renew
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: Code: 403. Errors:
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: * permission denied
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): renewer returned (maybe the lease expired)
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): failed to renew: Error making API request.
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: URL: PUT https://vault-internal.jumplead.com:8200/v1/sys/leases/renew
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: Code: 403. Errors:
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: * permission denied
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): renewer returned (maybe the lease expired)
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:08 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): failed to renew: Error making API request.
Sep 12 08:00:08 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: URL: PUT https://vault-internal.jumplead.com:8200/v1/sys/leases/renew
Sep 12 08:00:38 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:38 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): renewer returned (maybe the lease expired)
Sep 12 08:00:38 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:38 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/9d0e6f7a-c7cb-3e57-295e-621b8fe04a8f/server/local/config.json"
Sep 12 08:00:38 nomad-client-eu-west1-i-0ac713b705d70b918 nomad[5658]: 2017/09/12 08:00:38 [WARN] vault.read(rabbitmq/creds/readwriteconfigure): failed to renew: Error making API request.

The failure to renew the lease then just repeats in the log. Its odd this happens because the setup is exactly the same as the current production cluster.

I've checked by leases and stuff in vault and all seems good there.

Job file (if appropriate)

job "apiv2-worker" {
    datacenters = ["dc1"]
    type = "service"
    update {
        stagger = "45s"
        max_parallel = 1
    }

    group "worker" {
        count = 2

        restart {
            attempts = 10
            interval = "5m"
            delay = "25s"
            mode = "delay"
        }

        task "server" {
            driver = "docker"

            env {
                NODE_ENV = "production"
                NODE_PATH = "/usr/src/app"
                CONFIG_LOCATION = "${NOMAD_TASK_DIR}/config.json"
            }

            config {
                image = "*****"

                dns_servers = ["172.17.0.1"]

                auth {
                    username = "jumplead-docker-deploy"
                    password = "86SYf189W-Mz4ioXLL2z"
                }

                port_map {
                    healthcheck = 10060
                }
            }

            resources {
                cpu = 100
                memory = 120

                network {
                    mbits = 10
                    port "healthcheck" {
                        static = 29196
                    }
                }
            }

            service {
                name = "apiv2-worker"
                port = "healthcheck"
                tags = ["healthcheck"]
                check {
                    type = "http"
                    name = "AMQP Connected Check"
                    path = "/"
                    interval = "10s"
                    timeout = "5s"
                }
            }

            template {
                data = <<EOH
{
    "AMQP_URIS": [{{with $auth := secret "rabbitmq/creds/readwriteconfigure"}}{{range $i, $s := service "rabbitmq"}}{{ if ne $i 0 }},{{ end }}"amqp://{{$auth.Data.username}}:{{$auth.Data.password}}@{{.Address}}"{{end}}{{end}}],

    "API_URL": "http://apiv2.service.consul:29191/v2",

    "SYSTEMEMAIL_THRIFT": {
        "HOST": "system-email.service.consul",
        "PORT": 29174
    },

    "PRODUCTION": true
}
                EOH

                destination = "local/config.json"
                change_mode = "signal"
                change_signal = "SIGHUP"
            }

            vault {
                policies = ["rabbitmq"]
                change_mode = "signal"
                change_signal = "SIGHUP"
            }
        }
    }
}

@adamlc
Copy link
Author

adamlc commented Sep 12, 2017

Just a quick note, nomad is currently using root tokens to ensure it isn't a token permission issue.

@sethvargo
Copy link
Contributor

Root tokens are not renewable, so you'll need to disable that or use a periodic root token. That's one of the errors you're seeing.

@dadgar
Copy link
Contributor

dadgar commented Sep 12, 2017

@sethvargo Nomad server has root token, the instance of Consul-template doesn't have root. It has a renewable periodic token with the rabbitmq policy.

@adamlc This is likely a consul-template bug. Can you reproduce this behavior by running an instance of consul-template with that template?

@dadgar
Copy link
Contributor

dadgar commented Sep 14, 2017

@adamlc Any update? Can you also give the Consul and Vault versions.

@stevenscg
Copy link

I found this issue in the google group and dadgar pointed me here.

For me, it looked like the old mysql database backend was all that was affected, but I think it is probably more than that.

I am not using a root token, but a token from the aws-ec2 auth backend.

I just upgraded my test cluster from 0.6.0 to 0.6.3 and hit this. The update rate on the renewer is so fast that I had to shut everything down.

@dadgar
Copy link
Contributor

dadgar commented Sep 14, 2017

@stevenscg Can you attempt to run your config with just the newest consul-template. I am almost certain it is a bug in that codebase.

@stevenscg
Copy link

@dadgar I might be able to do that in a dev environment. It won't be the exact same setup, but the template and vault database backend and vault roles and policies will be. Might be tomorrow at the earliest. Are you looking for anything in particular.

I just finished rolling back the test environment to 0.6.0 and the issue appears to be gone.

@dadgar
Copy link
Contributor

dadgar commented Sep 14, 2017

@stevenscg That would be awesome. Mainly looking for as simple of a reproduction step as possible with just consul-template. If you could grab the debug level logs as well when you invoke that would be 👍

@stevenscg
Copy link

Will do.

FWIW, this is fairly representative of what I was seeing:

Sep 14 17:47:21 ip-10-101-33-183 nomad[667]: vault.read(database/creds/billing): failed to renew: Error making API request.#012#012URL: PUT https://active.vault.service.consul:8200/v1/sys/leases/renew#012Code: 403. Errors:#012#012* permission denied
Sep 14 17:47:21 ip-10-101-33-183 nomad: 2017/09/14 17:47:21 [WARN] vault.read(database/creds/billing): renewer returned (maybe the lease expired)
Sep 14 17:47:21 ip-10-101-33-183 nomad[667]: vault.read(database/creds/billing): renewer returned (maybe the lease expired)
Sep 14 17:47:21 ip-10-101-33-183 nomad: 2017/09/14 17:47:21 [WARN] vault.read(database/creds/billing): failed to renew: Error making API request.
Sep 14 17:47:21 ip-10-101-33-183 nomad: URL: PUT https://active.vault.service.consul:8200/v1/sys/leases/renew
Sep 14 17:47:21 ip-10-101-33-183 nomad: Code: 403. Errors:
Sep 14 17:47:21 ip-10-101-33-183 nomad: * permission denied
Sep 14 17:47:21 ip-10-101-33-183 nomad[667]: vault.read(database/creds/billing): failed to renew: Error making API request.#012#012URL: PUT https://active.vault.service.consul:8200/v1/sys/leases/renew#012Code: 403. Errors:#012#012* permission denied
Sep 14 17:47:21 ip-10-101-33-183 nomad: 2017/09/14 17:47:21 [WARN] vault.read(database/creds/billing): renewer returned (maybe the lease expired)
Sep 14 17:47:21 ip-10-101-33-183 nomad[667]: vault.read(database/creds/billing): renewer returned (maybe the lease expired)
Sep 14 17:47:21 ip-10-101-33-183 nomad: 2017/09/14 17:47:21 [WARN] vault.read(database/creds/billing): failed to renew: Error making API request.
Sep 14 17:47:21 ip-10-101-33-183 nomad: URL: PUT https://active.vault.service.consul:8200/v1/sys/leases/renew
Sep 14 17:47:21 ip-10-101-33-183 nomad: Code: 403. Errors:
Sep 14 17:47:21 ip-10-101-33-183 nomad: * permission denied

Sep 14 17:48:47 ip-10-101-25-169 nomad[21424]: vault.read(database/creds/billing): renewer returned (maybe the lease expired)
Sep 14 17:48:48 ip-10-101-25-169 nomad[21424]: vault.read(database/creds/billing): failed to renew: Error making API request.#012#012URL: PUT https://active.vault.service.consul:8200/v1/sys/leases/renew#012Code: 403. Errors:#012#012* permission denied
Sep 14 17:48:48 ip-10-101-25-169 nomad[21424]: vault.read(database/creds/billing): renewer returned (maybe the lease expired)
Sep 14 17:48:48 ip-10-101-25-169 nomad[21424]: vault.read(database/creds/billing): failed to renew: Error making API request.#012#012URL: PUT https://active.vault.service.consul:8200/v1/sys/leases/renew#012Code: 403. Errors:#012#012* permission denied

That doesn't tell me much, but it might come in handy. It was fast too. The local syslog daemons couldn't keep up with it.

@dadgar
Copy link
Contributor

dadgar commented Sep 14, 2017

@stevenscg What is the lease duration on secrets from: database/creds/billing

@stevenscg
Copy link

@dadgar

lease_duration 	1h0m0s
lease_renewable	true

Max ttl is 24h. Defaults for the backend, IIR.

@adamlc
Copy link
Author

adamlc commented Sep 15, 2017

Sorry I haven't had time to look at this yet. I'll give it a go later if I have time :)

@stevenscg
Copy link

stevenscg commented Sep 16, 2017

@dadgar I just tried rendering the same template from before using the consul-template 0.19.3 release:

2017/09/16 00:51:14.581357 [INFO] (runner) initiating run
2017/09/16 00:51:14.581365 [DEBUG] (runner) checking template 2d9350d20d9821111c000d5659375707
2017/09/16 00:51:14.581677 [DEBUG] (runner) rendering "/vagrant/billing-service/provision/app/app.ctmpl" => "/vagrant/billing-service/config/app.php"
2017/09/16 00:51:14.583573 [DEBUG] (runner) diffing and updating dependencies
.... kv.get, kv.block. vault.read, etc ....
2017/09/16 00:51:14.583632 [DEBUG] (runner) watching 7 dependencies
2017/09/16 00:51:14.583637 [DEBUG] (runner) all templates rendered
2017/09/16 00:52:58.767915 [DEBUG] (runner) receiving dependency health.service(cache-alpha|passing)
2017/09/16 00:52:58.767940 [INFO] (runner) initiating run

It seems to cycle through this every 1-3 minutes.

If I downgrade C/T to 0.19.2 and run the same setup in-place, the pattern is different:

2017/09/16 01:02:30.810225 [INFO] (runner) initiating run
2017/09/16 01:02:30.810237 [DEBUG] (runner) checking template 2d9350d20d9821111c000d5659375707
2017/09/16 01:02:30.810520 [DEBUG] (runner) rendering "/vagrant/billing-service/provision/app/app.ctmpl" => "/vagrant/billing-service/config/app.php"
2017/09/16 01:02:30.814312 [DEBUG] (runner) diffing and updating dependencies
.... kv.get, kv.block. vault.read, etc ....
2017/09/16 01:02:30.814441 [DEBUG] (runner) watching 7 dependencies
2017/09/16 01:04:25.171983 [DEBUG] (runner) receiving dependency health.service(cache-alpha|passing)
2017/09/16 01:04:25.172011 [INFO] (runner) initiating run

The [DEBUG] (runner) all templates rendered line only shows up 1 time with C/T v0.19.2 vs with every 1-3 minute interval with C/T v0.19.3.

For reference, this particular template has 1 secret from the (deprecated) mysql vault backend, 3 keys from consul, 1 service from consul.

I am not seeing any of the vault permission denied errors that was the most concerning issue from the other day.

Versions:

[vagrant@localhost ~]$ consul-template -version
consul-template v0.19.3 (ebf2d3d)
[vagrant@localhost ~]$ vault version
Vault v0.8.2 ('9afe7330e06e486ee326621624f2077d88bc9511')
[vagrant@localhost ~]$ consul version
Consul v0.9.3

@stevenscg
Copy link

@dadgar Another datapoint.... I went back to C/T v0.19.3 and dropped the mysql backend TTL from 1h to 5m and was able to trigger the renewer problem I saw in our deployed environment yesterday.

In fact, any TTL smaller than about 18 minutes on this particular system would trigger the issue. 30 minute TTL was reliably stable. This system is very lightly loaded FWIW.

The C/T output when the issue is triggered is roughly this:

....
2017/09/16 02:03:27.430761 [WARN] vault.read(mysql/creds/billing): renewer returned (maybe the lease expired)
2017/09/16 02:03:27.430781 [TRACE] vault.read(mysql/creds/billing): GET /v1/mysql/creds/billing?index=1505527407&stale=true&wait=1m0s
2017/09/16 02:03:27.454556 [TRACE] (view) vault.read(mysql/creds/billing) marking successful data response
2017/09/16 02:03:27.454613 [TRACE] (view) vault.read(mysql/creds/billing) no new data (index was the same)
2017/09/16 02:03:27.454623 [TRACE] vault.read(mysql/creds/billing): starting renewer
2017/09/16 02:03:27.454699 [TRACE] (view) vault.read(mysql/creds/billing) successful contact, resetting retries
2017/09/16 02:03:27.462674 [TRACE] vault.read(mysql/creds/billing): successfully renewed
2017/09/16 02:03:27.462700 [WARN] vault.read(mysql/creds/billing): renewer returned (maybe the lease expired)
2017/09/16 02:03:27.462715 [TRACE] vault.read(mysql/creds/billing): GET /v1/mysql/creds/billing?index=1505527407&stale=true&wait=1m0s
2017/09/16 02:03:27.479821 [TRACE] (view) vault.read(mysql/creds/billing) marking successful data response
2017/09/16 02:03:27.479846 [TRACE] (view) vault.read(mysql/creds/billing) no new data (index was the same)
2017/09/16 02:03:27.479853 [TRACE] vault.read(mysql/creds/billing): starting renewer
2017/09/16 02:03:27.479914 [TRACE] (view) vault.read(mysql/creds/billing) successful contact, resetting retries
2017/09/16 02:03:27.484584 [TRACE] vault.read(mysql/creds/billing): successfully renewed
2017/09/16 02:03:27.484607 [WARN] vault.read(mysql/creds/billing): renewer returned (maybe the lease expired)
2017/09/16 02:03:27.484623 [TRACE] vault.read(mysql/creds/billing): GET /v1/mysql/creds/billing?index=1505527407&stale=true&wait=1m0s
2017/09/16 02:03:27.544348 [TRACE] (view) vault.read(mysql/creds/billing) marking successful data response
2017/09/16 02:03:27.544382 [TRACE] (view) vault.read(mysql/creds/billing) no new data (index was the same)
2017/09/16 02:03:27.544390 [TRACE] vault.read(mysql/creds/billing): starting renewer
2017/09/16 02:03:27.544464 [TRACE] (view) vault.read(mysql/creds/billing) successful contact, resetting retries
2017/09/16 02:03:27.553385 [TRACE] vault.read(mysql/creds/billing): successfully renewed
....

@adamlc
Copy link
Author

adamlc commented Sep 19, 2017

@dadgar I've just tried using consul template and it seems to work as expected, it doesn't constantly cycle the credentials like nomad does:

$ consul-template -v
consul-template v0.19.3 (e08c904)

Using the same template as my nomad file. Also using a root token here. (Just thought I'd mention I've tried using a root token in Nomad and one supplied by ec2-auth, both have the same results)

$ consul-template -consul-addr "54.229.122.21:8500" -vault-renew-token=false -template "template.tpl:template.json"

@holtwilkins
Copy link
Contributor

holtwilkins commented Sep 25, 2017

I can confirm that I'm hitting this issue as well on Nomad 0.6.3, though I never ran Nomad 0.6.0 for comparison. These error messages cause enormous raft churn and consul servers go crazy trying to keep up with creating and reaping snapshots.

It seems like this is affecting the consul and database secret backends, but does not seem to affect the aws backend. Not sure if this is helpful or not, but figured I'd post in case it is.

Don't think we're doing anything special wrt TTLs of these mountpoints, but not sure where all to check:

[devtools-staging] hwilkins ~ $ vault mounts
Path                                           Type        Accessor             Plugin  Default TTL  Max TTL    Force No Cache  Replication Behavior  Description
aws/                                           aws         aws_69fee16d         n/a     14400        14400      false           replicated
consul/                                        consul      consul_49f9719c      n/a     14400        14400      false           replicated
cubbyhole/                                     cubbyhole   cubbyhole_650dc4a0   n/a     n/a          n/a        false           local                 per-token private secret storage
database/                                      database    database_137ef1c0    n/a     14400        14400      false           replicated            generic database backend
secret/                                        generic     generic_13ee50f4     n/a     system       system     false           replicated            generic secret storage
ssh/                                           ssh         ssh_db590344         n/a     600          600        false           replicated
sys/                                           system      system_d6587c9c      n/a     n/a          n/a        false           replicated            system endpoints used for control, policy and debugging

@dadgar
Copy link
Contributor

dadgar commented Sep 25, 2017

Hey all, we will be looking into this before cutting 0.7, so the next release should resolve this.

@dadgar
Copy link
Contributor

dadgar commented Oct 11, 2017

Hey all, to mitigate this until 0.7 is out you can set the vault_grace to be much smaller than the lease of your secret. Something like 5-15 seconds.

@adamlc
Copy link
Author

adamlc commented Nov 13, 2017

@dadgar unfortunately I'm still getting this issue on 0.7.0, even after setting a low vault_grace period :(

@dadgar
Copy link
Contributor

dadgar commented Nov 13, 2017

@adamlc What is the lease duration of your secret? Your vault_grace should be set to be less than your secrets lease duration. If this isn't the case, the template will re-render the template immediately since the secret will expire within the grace period.

https://www.nomadproject.io/docs/job-specification/template.html#vault_grace

@adamlc
Copy link
Author

adamlc commented Nov 14, 2017

@dadgar I set the grace to 10s and all the leases are >24 hours. I'll have to have a dive into the logs again.

I'll see if I can come up with something producible for you to work with :)

@adamlc
Copy link
Author

adamlc commented Nov 20, 2017

I have a suspicion this is something to do with the old mysql backend. I have 14k+ leases for mysql which probably got created when the tasks where restarting. I'm currently decommissioning the backend and revoking the leases. Hopefully this is whats causing it!

@stevenscg
Copy link

@adamlc @dadgar I'm seeing this on 0.7.0 as well and also with the old mysql backend. I am going to try adding vault_grace to my template stanzas, but the docs read like the default is 15s. Is that true?

@adamlc
Copy link
Author

adamlc commented Nov 30, 2017

@stevenscg its interesting that you also use the old mysql backend. I've not actually had time to take a look at this in more detail yet.

I'm not sure what the default grace is but if you do a nomad plan before you should be able to see what the default / current one is.

@stevenscg
Copy link

@adamlc Ah. Good point about nomad plan. I'll try that. Did you just move to the new database secret backend instead of messing with it further?

@stevenscg
Copy link

I'll add that my database secret leases are 4h / 24h max:

Key      	Value
---      	-----
lease    	4h0m0s
lease_max	24h0m0s

@adamlc
Copy link
Author

adamlc commented Nov 30, 2017

I'm still rocking 0.6.3 at the moment. I have migrated to the database backend but haven't had time to test the setup in 0.7.0 yet.

@stevenscg
Copy link

Got it. FYI, the default vault_grace does seem to be 15s as documented. I'm testing with an arbitrary 10s, but don't know why it would be any better.

@stevenscg
Copy link

@dadgar I'm still seeing this. Any thoughts before I roll back to 0.6.0 ??

Here's where I'm at:

  • Nomad 0.7.0
  • Vault 0.9.0
  • Now using the vault database backend instead of the older mysql.
  • If it matters functionally, I'm using plugin_name=mysql-aurora-database-plugin for mysql 5.6 / aurora compatibility
  • database role default_ttl = 12h, max_ttl=48h
  • vault_grace = 30s on my test job

@stevenscg
Copy link

I'll add the re-render rate with this new database backend seems higher than the legacy mysql backend. Up to approx 4-5 seconds between vs 20 or 30s.

@jippi
Copy link
Contributor

jippi commented Nov 30, 2017

@stevenscg I got similar setup, with the exception of Vault, which is 0.8.3 - no problems on 0.7

@stevenscg
Copy link

@jippi That's the crazy part to me. I'd think that my setup is really a very common one.

My vault was at 0.8.3 earlier today and the issue was happening both before and after the upgrade to 0.9.0.

@stevenscg
Copy link

@jippi How do you upgrade your nomad server and worker nodes?

I typically upgrade them in-place carefully one at a time servers then workers.

@stevenscg
Copy link

I've posted some additional details in the mailing list in case anybody else is experiencing this problem, though I'm worried that it is something very specific with my environment.

@stevenscg
Copy link

I ran the templates through consul-template versions 0.19.3 and 0.19.4. Both gave the same sys/leases/renew error.

It looks like this is going to be a vault default policy issue in my environment.

The default policy for my deployed environments has been around since Vault 0.6 or something.

On our development systems that were not experiencing this issue, vault is loaded fresh and thus the default policy is also refresh.

I vaguely remembered a comment or a discussion about the vault policy changing and the task token needed access to sys/leases/renew in vault.

The piece I needed is here:
https://github.com/hashicorp/nomad/blob/master/website/source/docs/vault-integration/index.html.md#permission-denied-errors

But not on the nomad website just yet.

@adamlc
Copy link
Author

adamlc commented Jan 16, 2018

I'm just looking at this issue again and I'm still having issues with 0.7.1.

At first I thought it may be my database backend, I didn't have TTLs set for the database roles, so I thought that may be the issue, unfortunately not. I'm going to go through each secret referenced to make sure they have TTLs.

Other than the above I'm stumped! 😢

I'll post here if anything comes to light!

@adamlc
Copy link
Author

adamlc commented Jan 16, 2018

Turns out that my vault default policy was completely different to newer versions, once I updated my default policy with one from a newer version it worked perfectly! Finally glad to have this sorted!

Thanks @stevenscg for pointing me in the right direction!

For anyone else with the same issue, simply grab the latest default vault policy and then overwrite your current default policy in vault and you should be good to go!

@stevenscg
Copy link

@adamlc Glad it helped you! So much grey hair over that one.

@github-actions
Copy link

github-actions bot commented Dec 4, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants