Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

approle "expiration: revoked lease" stderr logs #8117

Closed
atheiman opened this issue Jan 8, 2020 · 9 comments
Closed

approle "expiration: revoked lease" stderr logs #8117

atheiman opened this issue Jan 8, 2020 · 9 comments
Labels
auth/approle bug Used to indicate a potential bug

Comments

@atheiman
Copy link
Contributor

atheiman commented Jan 8, 2020

We are running vault 1.0.3 community version. 3 Vault nodes on EC2 running in HA mode backed by 5 Consul EC2 instances.

Our vault stderr contains logs like this at an incredibly high rate, roughly 114,000 times in the last 4 hours:

2020-01-07T14:30:55.245-0600 [WARN]  expiration: lease count exceeds warning lease threshold
2020-01-07T14:30:56.855-0600 [INFO]  expiration: revoked lease: lease_id=auth/approle/dev/login/h2432bf289b398772dd75d3b82a880321b8e4edf8b1012f239d0e8f93d92cdd5e
2020-01-07T14:31:00.367-0600 [INFO]  expiration: revoked lease: lease_id=auth/approle/dev/login/hcde1285acf52a9667fc6b0f195a3dfbccf6ea8b36bf6acd846ad83ea61c2ac7f
...
2020-01-07T14:31:48.416-0600 [INFO]  expiration: revoked lease: lease_id=auth/approle/dev/login/hd32d64c2f8a202f5de9c89a04572d32c1a888432334b59927ced5c4fa81354a5
2020-01-07T14:31:56.271-0600 [WARN]  expiration: lease count exceeds warning lease threshold
2020-01-07T14:31:57.002-0600 [INFO]  expiration: revoked lease: lease_id=auth/approle/dev/login/h511e1df236bb7003b440ea067d59fc9c38febf7eb9d503051f5192c7e570449a
...
2020-01-07T14:32:07.315-0600 [INFO]  expiration: revoked lease: lease_id=auth/approle/dev/login/hb6a536297419b1cce9f56ff01977c55de026bf681cbe08c0c3fa927148904775
...

We tried disabling the "dev" approle auth method with plans to re-enable it, but this operation would time out. So then we restored consul to a snapshot from about an hour before disabling this approle auth method, and we have deleted all roles from this approle auth method. But this message still occurs.

@atheiman atheiman changed the title approle "expiration: revoked lease" stdout logs approle "expiration: revoked lease" stderr logs Jan 8, 2020
@ncabatoff
Copy link
Collaborator

It sounds like you were generating leases faster than you can revoke them. To prevent this in future, consider increasing TTLs. For now, assuming the messages you cite are representative (i.e. they're mostly about auth/approle/dev), given that you've deleted the roles it should sort itself out on its own. You should be able to monitor the situation using the metric vault.expire.num_leases (see telemetry)

@ppestinger
Copy link

@ncabatoff is there a good way to link the leases we're seeing to a specific approle? Even though we've deleted all the roles from the approle auth method in question (auth/approle/dev), we're still seeing the revoked lease messages this morning, almost a full day later.

Basically, can we link a lease id like "lease_id=auth/approle/dev/login/h9d7dea14c17ded56097bf4203ce6c6a7179213851c06f9395c43d3329d97530d" to one of our pre-existing approles?

We're still looking into the telemetry solution you suggested and how we can consume that.

@ppestinger
Copy link

We were able to check the vault.expire.num_leases metric and it is... enlightening! Here's our output:

[2020-01-08 09:55:20 -0600 CST][G] 'vault.expire.num_leases': 448155.000
[2020-01-08 09:55:30 -0600 CST][G] 'vault.expire.num_leases': 448151.000
[2020-01-08 09:55:40 -0600 CST][G] 'vault.expire.num_leases': 448151.000
[2020-01-08 09:55:50 -0600 CST][G] 'vault.expire.num_leases': 448147.000
[2020-01-08 09:56:00 -0600 CST][G] 'vault.expire.num_leases': 448146.000
[2020-01-08 09:56:20 -0600 CST][G] 'vault.expire.num_leases': 448145.000
[2020-01-08 09:56:30 -0600 CST][G] 'vault.expire.num_leases': 448141.000
[2020-01-08 09:56:40 -0600 CST][G] 'vault.expire.num_leases': 448141.000
[2020-01-08 09:56:50 -0600 CST][G] 'vault.expire.num_leases': 448137.000
[2020-01-08 09:57:00 -0600 CST][G] 'vault.expire.num_leases': 448136.000

We've got nearly half a million of these left to work through, and at the rate we're seeing Vault expire them, that's really not a feasible solution. Is there a way to mass expire all of these leases?

@ncabatoff
Copy link
Collaborator

Is there a way to mass expire all of these leases?

You could try a prefix-based revoke: https://www.vaultproject.io/docs/commands/lease/revoke.html

@atheiman
Copy link
Contributor Author

atheiman commented Jan 8, 2020

@ncabatoff does that work for approle leases (in an auth method) rather than in a secrets engine lease?

@calvn
Copy link
Contributor

calvn commented Jan 8, 2020

is there a good way to link the leases we're seeing to a specific approle?

If you have audit logs enabled, looking that requests against auth/approle/dev/login would get you a step closer in that direction. The process might be a bit tedious since the secret ID value will be hashed, but you could correlate it by listing the IDs for a role via https://www.vaultproject.io/api/auth/approle/index.html#list-secret-id-accessors and then calculating hash of those by using https://www.vaultproject.io/api/system/audit-hash.html.

@ncabatoff
Copy link
Collaborator

does that work for approle leases (in an auth method) rather than in a secrets engine lease?

Yes it does.

@catsby catsby added auth/approle bug Used to indicate a potential bug labels Jan 9, 2020
@Lucas-C
Copy link

Lucas-C commented Apr 2, 2020

We had the same issue yesterday.

A bit of context on this warning :

We should issue a warning in the log message when too many leases are active in the system. This is usually an indication that the rate in (new leases) does not match rate out (expiration / revocation) and will eventually lead to cluster degradation or failure.

Just FYI, in our case, as we had an incident due to this, we decided to take a few actions following an internal post-mortem :

  • set a default "low" token_max_ttl for all approles
  • set a Grafana alert based on the vault.expire.num_leases metric
  • document how to revoke all leases using the CLI, in case of emergency

Maybe this warning could be a bit more visible / documented in Vault documentation ?

@heatherezell
Copy link
Contributor

Due to the age of this issue and its quiescence, I'm going to go ahead and close it. Please feel free to re-open it if the behavior persists in current versions of Vault. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auth/approle bug Used to indicate a potential bug
Projects
None yet
Development

No branches or pull requests

7 participants