Strange token with negative TTL blocking Vault shutdown #4143

SoMuchToGrok · 2018-03-16T18:55:06Z

Vault Version: v0.9.5
Operating System/Architecture: Ubuntu 16.04.03

Vault Config File:

backend "consul" {
  address = "127.0.0.1:8500"
  path = "vault"
  token = ""
}

listener "tcp" {
  address = "1.1.1.1:8200"
  tls_disable = 0
  tls_cert_file = "/vault/server.crt"
  tls_key_file = "/vault/server.key"
}

max_lease_ttl = "2880h"

Issue:

I'm running into an issue where Vault’s shutdown is blocked in such a way that it stops serving secrets, but never gives up the lock (so the secondary node always stays inactive). This problem started after we upgraded to v0.9.5 from v0.7.3. It appears to be stuck in a loop trying to get the following keys from the storage backend:

2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/cc5113c6c15b26c4c359407f35b49b0b33d8d6f7/?keys=&separator=%2F (1.026998ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/840bcfe07c0ce34bae584de5a5de6c1857c6d90a/?keys=&separator=%2F (1.029913ms) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/d85cf4528416f66409cc3b5d9433d58a39dc9cc4/?keys=&separator=%2F (956.545µs) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/888b51a27895d526857553f8738162661328d74d/?keys=&separator=%2F (975.086µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/dc067819860c1dd23f458b418fcd7aed0b1a697e/?keys=&separator=%2F (1.015359ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/9544625267777e034883cbd8a576de268d5242b0/?keys=&separator=%2F (962.588µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/61da133ade358ec06ae6e6f0152a57235574877f/?keys=&separator=%2F (1.139707ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/995823d5e90d6893865bcdaba8678b1fb0939b7a/?keys=&separator=%2F (1.003417ms) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/9dc76f1234d11481dcaf614885293c92c463ffe6/?keys=&separator=%2F (995.442µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/098e2b0af50e1dd46cdb3ed4ba03a0d869c93f7f/?keys=&separator=%2F (1.047521ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/3dd19860320685b7b34e99ac8cf8db8b2c83081c/?keys=&separator=%2F (975.041µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/9f357234883dccdad0caeeb78265589c24d35a12/?keys=&separator=%2F (1.040412ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/421cfdbc380512d04780ff506960cfd1743a3759/?keys=&separator=%2F (974.111µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/be4f6170632f168a81c599276cf93c48abd11beb/?keys=&separator=%2F (974.154µs) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/4d54cc468ba8b7089c51dee15dc5693209269c51/?keys=&separator=%2F (994.944µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/be85887a1bc57a393d903cf6cede8bfa8d083ffe/?keys=&separator=%2F (949.421µs) from=127.0.0.1:9954

Additionally, while Vault is active and serving secrets, we constantly see the same calls being made to the storage backend. We’ve traced all these back to a single parent token and the child tokens it has issued.

── 8561c9e01661bbcbcc2a038f1e7787246cadf40c
  ├──  61da133ade358ec06ae6e6f0152a57235574877f
        │   ├── 098e2b0af50e1dd46cdb3ed4ba03a0d869c93f7f
        │   ├── 3dd19860320685b7b34e99ac8cf8db8b2c83081c
        │   ├── 421cfdbc380512d04780ff506960cfd1743a3759
        │   ├── 4d54cc468ba8b7089c51dee15dc5693209269c51
        │   ├── 508c12f556d19093e9167d5dcacda376d10cc5d9
        │   ├── 560883d56509e907837d14d9c12f74921c7e6624
        │   ├── 6bacbce05311553b655a29141094fe352afa5427
        │   ├── 6c1ce5eec8d86db284c3aa8cf56c983614638f7c
        │   ├── 6e0249999410712a23d8c737a901624664f2fe94
        │   ├── 7b7b4283827160a365d6d9c8522103a5581b8736
        │   ├── 840bcfe07c0ce34bae584de5a5de6c1857c6d90a
        │   ├── 888b51a27895d526857553f8738162661328d74d
        │   ├── 9544625267777e034883cbd8a576de268d5242b0
        │   ├── 995823d5e90d6893865bcdaba8678b1fb0939b7a
        │   ├── 9dc76f1234d11481dcaf614885293c92c463ffe6
        │   ├── 9f357234883dccdad0caeeb78265589c24d35a12
        │   ├── ba619887b09ba9be87b8ab8d0a04172f6cda41c7
        │   ├── be4f6170632f168a81c599276cf93c48abd11beb
        │   ├── be85887a1bc57a393d903cf6cede8bfa8d083ffe
        │   ├── cc5113c6c15b26c4c359407f35b49b0b33d8d6f7
        │   ├── d85cf4528416f66409cc3b5d9433d58a39dc9cc4
        │   └── dc067819860c1dd23f458b418fcd7aed0b1a697e

The first parent and child have leases, but no lease is found for any of the remaining children. These leases were created on v0.7.3.

{
  "request_id": "d07e234f-a561-ce71-ff39-e768f20ee921",
  "lease_id": "",
  "renewable": false,
  "lease_duration": 0,
  "data": {
    "expire_time": "2017-10-29T18:29:31.208498887Z",
    "id": "auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
    "issue_time": "2017-09-27T18:29:31.208493668Z",
    "last_renewal": "2017-09-27T18:29:31.437838733Z",
    "renewable": false,
    "ttl": -11924360
  },
  "wrap_info": null,
  "warnings": null,
  "auth": null
}

{
  "request_id": "6346938a-83c4-891f-f264-1f02d2463011",
  "lease_id": "",
  "renewable": false,
  "lease_duration": 0,
  "data": {
    "expire_time": "2017-11-04T15:49:30.21318925Z",
    "id": "auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f",
    "issue_time": "2017-09-27T18:29:31.578314239Z",
    "last_renewal": "2017-10-03T15:49:30.213189473Z",
    "renewable": false,
    "ttl": -11327396
  },
  "wrap_info": null,
  "warnings": null,
  "auth": null
}

Has anyone seen this before? I’m trying to figure out how it got into the state in the first place. Not sure how to properly clean this up. Have already tried doing a tidy with no luck.

Thanks in advance for any assistance!

The text was updated successfully, but these errors were encountered:

vishalnayak · 2018-03-16T19:31:57Z

@SoMuchToGrok I am not sure what the problem could be offhand. So, please don't take my words as a possible solutions. They are just things you might want to try, given the situation.

Assuming you have the backup of the storage, can you try bringing up v0.7.3 again on it and revoke the problematic token? This might clear its lease and clear out the leases of the child tokens as well.

Another possibility is to try to perform an incremental upgrade to the versions between 0.7.3 and the latest, while keeping an eye on the upgrade guides (https://www.vaultproject.io/guides/upgrading/index.html). This might help to figure out which version is failing to handle a proper upgrade.

SoMuchToGrok · 2018-03-16T19:48:51Z

Thanks for the response @vishalnayak

I have a backup of the storage, but I'm hesitant to restore it because it typically introduces new problems (it tries to revoke secrets that "old" Vault already revoked, and the "new" Vault has no knowledge of the secrets that "old" Vault issued, leaving around secrets that will never get cleaned up). We're using a variety of different secret backends, so this becomes a fairly cumbersome issue to deal with. If there's an approach that's reasonably easy that can mitigate these factors, I'd be okay trying this.

I read through the Vault upgrade guide before upgrading, but nothing appeared to be particularly relevant. Given that the lease in question here had a negative TTL many months before we upgraded to v0.9.5, I'm not certain it's 100% (emphasis on 100%) related to the upgrade. With that said, the new version is clearly enforcing some behavior differently than before.

I have yet to try revoking/force-revoking the problematic token yet on v0.9.5. It's something I plan to try, but I'd like to understand what's going on here before I start deleting anything. This problem is in my staging environment, so I'm trying to be extra careful as I only have 1 degree of separation from production.

jefferai · 2018-03-17T17:49:50Z

It'd be helpful if you can get a trace log; additionally when Vault is blocked if you can send a SIGQUIT it will stack trace which can help us figure out whether the token issue or something else is the cause of the hanging -- it might not be, and might be due to some changing of internal locking logic that happened in 0.9.5.

As for the tokens, we've fixed a number of issues over time related to tokens not being cleaned up properly when they have expired (which doesn't allow using them, just not cleaning them up). It's possible this is something that can be fixed by running https://www.vaultproject.io/api/auth/token/index.html#tidy-tokens

SoMuchToGrok · 2018-03-19T11:41:51Z

@jefferai Thanks for the assistance. I tried hitting auth/token/tidy but it didn't cleanup the tokens in question.

Here are the trace logs from the first start of Vault, to issuing a SIGQUIT (including the stack trace). Unfortunately I'm not seeing any of the tokens in question referenced in the logs, but hopefully this is helpful.
https://pastebin.com/raw/xDyGLFes
Here is the stacktrace we get when we issue a SIGQUIT after issuing a SIGINT.
https://gist.github.com/dmicanzerofox/238d7b557951c786af9b36d654dd288f

jefferai · 2018-03-19T12:32:34Z

One initial comment from the earlier part of the logs: you have a lot of users / resources, across different backends (pki, aws, etc.) where the underlying data (user, etc.) appears to have been removed without Vault's action. Or, possibly, you reloaded Vault data from a backup state after it had already deleted these values. You'll need to use revoke-force (https://www.vaultproject.io/api/system/leases.html#revoke-force) to clean those up.

jefferai · 2018-03-19T13:04:22Z

Took a look at the stacktrace. It's cut off, but based on the very last line of it, I actually think you're hitting a deadlock that is currently fixed in master and will be a part of 0.9.6, which should be coming this week.

SoMuchToGrok · 2018-03-19T13:30:35Z

Sorry about that - systemd most likely stopped capturing stdout/stderr too early.

Here is the full stracktrace:
https://pastebin.com/raw/watS2FX5
(SIGQUIT after SIGINT)

jefferai · 2018-03-21T14:47:36Z

I think this should be fixed in the just-released 0.9.6...please let us know!

SoMuchToGrok · 2018-03-22T00:14:20Z

Thanks @jefferai

I'll get back to this in a few days and will update with the results!

SoMuchToGrok · 2018-03-26T12:27:58Z

We're still experiencing this issue on v0.9.6. For more context, I've never manually deleted anything from the storage backend, and I've never restored from a backup (I have in other environments, but not this one).

This issue appears to be identical - #4179

jefferai · 2018-03-26T13:27:11Z

Can you get another SIGQUIT stacktrace from 0.9.6? It would be good to see if the changes made from 0.9.5/0.9.6 have shifted the root cause.

SoMuchToGrok · 2018-03-26T15:06:59Z

Stacktrace from 0.9.6

https://pastebin.com/raw/MLh5jQwj

jefferai · 2018-03-26T16:54:26Z

Been looking at the stacktrace. One question: When shutdown is blocked, are you still seeing events in the log? e.g. are those GETs happening while you're waiting for Vault to shut down? If so, how long do you wait before deciding Vault is stuck? Anything interesting in Vault's logs during that time? How long has Vault been up?

I ask because I don't yet see evidence that Vault is blocked but I do see evidence that Vault is trying to remove large numbers of expired tokens. It appears like it might not be that Vault is blocked so much as so many things are happening at once due to reading in leases and trying to expire old ones that Vault simply hasn't gotten to servicing the request to shut down yet. I don't think Go's locks provide priority for write operations, so if a a few hundred or thousand requests are outstanding and each request is a network event it could take a while for the shutdown process to grab Vault's state lock successfully.

jefferai · 2018-03-26T17:03:57Z

Another question -- you said it's the same issue as #4179, can you explain why you think so? Did you find that by removing that parent prefix it's instant, e.g. #4179 (comment) ?

SoMuchToGrok · 2018-03-26T17:53:32Z

When the shutdown is blocked, I don't see anything particularly relevant in the Vault logs - but I do see the GETs from the consul monitor (letting the monitor cmd run for a few hours + some cut/uniq magic confirms that we see the same GETs over and over). I've waited at most ~4 days before killing it. Haven't come across anything interesting in the logs in general. Prior to running into this issue (pre-upgrade), Vault had been deployed for roughly 1 year.

I believe it's the same issue as #4179 because I see the same loop active from the ~moment Vault turns on until I'm forced to SIGQUIT (it's not just during a shutdown/step-down). It also appears that the parent token relationship he describes is somewhat similar to what I'm seeing here. Haven't tried removing the parent prefix yet as I'd like to be 100% confident in the issue at hand before I start touching things (this is my staging environment and I'm trying my best to treat it as production). With that said, I would be comfortable removing it manually if you think that's the best course of action.

I'd be interested in knowing if #4179 "blocks" a shutdown.

jefferai · 2018-03-26T18:10:11Z

Good to know -- from above I didn't see it looping over the same IDs over and over but if you are seeing that then it does look similar. I'd prefer you not removing it manually yet as we're working on a patch and would be good to see if you can try that and have it work for you.

SoMuchToGrok · 2018-03-26T19:41:18Z

Thanks @jefferai, sounds good.

jefferai · 2018-03-26T20:07:32Z

Any chance you're up for building from master and/or a branch and testing?

SoMuchToGrok · 2018-03-26T20:59:23Z

Yup, can easily build from whatever.

jefferai · 2018-03-28T15:49:31Z

Still trying to figure this out (since in the other thread you indicated that the changes we put in based on that OP's report did not help you).

I have a couple of theories, but they're not certain yet...and it may be that both are true, or neither.

One is that in your logs there are a ton of leases that are not able to be revoked (AWS users that have since been deleted, postgres users that pg refuses to drop because objects depend on them). Those leases will be attempted to be revoked 6 times each (and then if it fails it'll stop for that session but still try again later). I don't know how many leases there are that these various tokens generated, but with each of those it might be trying to look up the associated token that was encoded in the leases. You could try using https://www.vaultproject.io/api/system/leases.html#revoke-force to clean those up (or for postgres change your revocation statements to force pg to drop the user, or so). That will tell Vault to give up trying to actually perform revocation, and allow it to continue cleaning up the leases locally.

The negative TTL is actually a potential indicator here. I forget how the logic has changed between 0.7.3 and now, but I believe it will try to revoke leases attached to a token first, before the token itself is revoked (which is different from that token being usable, which it should not be if the TTL is expired). So seeing such tokens might indeed mean that it's not cleaned up because of the attached leases.

The other possibility is that the hanging you're seeing trying to shut down is related to the expiration attempts, not the tokens. Basically, maybe there is a bad state if a shutdown is triggered while expirations are ongoing. One thing that would be very useful to understand here is if you see the same behavior on 0.9.4. In 0.9.5 we modified some of the sealing logic, and more in 0.9.6, and in general this was to remove some potential races/deadlocks, but it's possible we introduced a new one at the same time. If you see the same behavior in 0.9.4 we'll know it's not related to these changes, which would be a useful data point.

SoMuchToGrok · 2018-03-29T19:31:57Z

Sounds good @jefferai, appreciate the additional information.

I'll give this a try on 0.9.4 and will report back. Unfortunately might not be able to get around to it for a few days, but will post here as soon as it's done.

SoMuchToGrok · 2018-04-10T15:35:28Z

Just an update - I will be getting around to this in a few days (ETA Wednesday/Thursday). It's been a ridiculous past few weeks for me :)

SoMuchToGrok · 2018-04-11T14:02:02Z

Just tested it; appreciate your patience. Unfortunately, still experiencing the same issue on v0.9.4.

Caveat with 0.9.4 - I had to compile the binary against go v1.10.1 because of the PKI/DNS SAN issue in go v1.10.

jefferai · 2018-04-11T14:15:31Z

OK, so that's actually good in a way -- it means that the changes in shutdown logic are not at fault. So I think we're back at "Vault keeps trying to revoke things it can't actually revoke". Can you work on either fixing your revocation sql or using revoke-force as appropriate?

SoMuchToGrok · 2018-04-11T16:48:36Z

Sounds good - I will start by auditing my postgresql revoke statements.

Is there a recommended approach for "detecting" these issues? Or are logs the quickest/most accurate way to discover these revocation failures?

dmicanzerofox · 2018-04-28T20:54:18Z

From my testing it looks like revokTreeSalted can can cycle, like if a child token was some how the parent of one of the parents. Creating the token entries through the public method prohibited the cycle, but crafting them manual should illustrate it . (Also I don't think we actually have a cycle :p I just noticed it while doing this investigation)

+/*
+2018/03/12 12:15:51 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/6c1ce5eec8d86db284c3aa8cf56c983614638f7c/?keys=&separator=%2F
+(1.040345ms) from=127.0.0.1:39994
+ */
+func TestTokentStore_RevokeTree_CycleLogs(t *testing.T) {
+       _, ts, _, _ := TestCoreWithTokenStore(t)
+
+       root1 := &TokenEntry{
+               Path: "parent/1/",
+               ID: "1",
+       }
+       if err := ts.create(context.Background(), root1); err != nil {
+               t.Fatalf("err: %v", err)
+       }
+
+       root2 := &TokenEntry{
+               Path: "parent/2/",
+               ID: "2",
+       }
+       if err := ts.create(context.Background(), root2); err != nil {
+               t.Fatalf("err: %v", err)
+       }
+
+       // add the second Root as a child of the parent
+       entry := &TokenEntry{
+               Parent: root1.ID,
+               ID: "3",
+       }
+       if err := ts.create(context.Background(), entry); err != nil {
+               t.Fatalf("err: %v", err)
+       }
+       /*
+       ctx := context.Background()
+
+       entry.Policies = policyutil.SanitizePolicies(entry.Policies, policyutil.DoNotAddDefaultPolicy)
+
+       err := ts.createAccessor(ctx, entry)
+       if err != nil {
+               t.Fatalf("err: %v", err)
+       }
+
+       ts.storeCommon(ctx, entry, true)
+       */
+       err := ts.RevokeTree(context.Background(), root2.ID)
+       if err != nil {
+               t.Fatalf("err: %v", err)
+       }
 }

burdandrei · 2018-04-29T18:54:00Z

@jefferai thanks for pointing me here.
We updated 0.9.6 to 0.10.0.
I noticed the issue, cause Vault is running on Consul server Instance, and we were using t2 instance. so after more than the 24-hour loop, we ran out of CPU cycles and I noticed the behavior.
never restored backend from the backup.
But we had a problem updating to 0.9.0 and using database(MongoDB) mount - timeouts and really strange things.

dmicanzerofox · 2018-04-30T13:18:24Z

I'm not sure if it will help at all but attached is an analysis of the tokens that show up in the step down loop.

bad.tokens.txt

jefferai · 2018-04-30T13:52:05Z

@burdandrei did you come from an older version prior to 0.9.6? If so how far back?

dmicanzerofox · 2018-04-30T14:30:53Z

We finally took the step and tried to revoke the leases connected to the looping tokens:

VAULT_CLIENT_TIMEOUT=600 vault revoke -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f

# Error revoking leases with prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: Put https://x.x.x.x:8200/v1/sys/leases/revoke-prefix/auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

VAULT_CLIENT_TIMEOUT=600 vault revoke -force=true -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f

# Error revoking leases with prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: Put https://x.x.x.x:8200/v1/sys/leases/revoke-prefix/auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
```

Both commands froze until timeout was reached.

jefferai · 2018-04-30T14:33:45Z

@dmicanzerofox unfortunately that means that now your state will be different than the last log you sent. We found something odd in it and are still trying to figure out what could have caused it.

calvn · 2018-04-30T14:58:11Z

@dmicanzerofox I noticed you edited the post where you uploaded bad.token.txt; can you re-post the old version of it? We found the oddity in that file that is no longer present in the newer one.

jefferai · 2018-04-30T15:28:13Z

I have a copy.
bad.tokens.txt

burdandrei · 2018-04-30T15:33:53Z

@jefferai this cluster had every version from 0.5, or maybe even 0.4 :)

dmicanzerofox · 2018-04-30T16:20:03Z

@calvn yes! sorry, there was a "bug" in the original, I was misattributing the Parent reference.

For the original txt file, I was checking to see if the token id was present in the parent entry. The following is an example from the first txt file.

{
  "ID": "8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "Primary": "vault/sys/token/id/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "Lease": "vault/sys/expire/id/auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "SecondaryIndex": "",
  "ParentIndex": "vault/sys/token/parent/8561c9e01661bbcbcc2a038f1e7787246cadf40c/61da133ade358ec06ae6e6f0152a57235574877f",
  "Secret": {
    "request_id": "4a1dd037-d0c1-08be-2c14-93e8b19604ce",
    "lease_id": "",
    "lease_duration": 0,
    "renewable": false,
    "data": {
      "expire_time": "2017-10-29T18:29:31.208498887Z",
      "id": "auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
      "issue_time": "2017-09-27T18:29:31.208493668Z",
      "last_renewal": "2017-09-27T18:29:31.437838733Z",
      "renewable": false,
      "ttl": -15557402
    },
    "warnings": null
  }
}

After I posted i tried to fix the script so that the root nodes wouldn't have any value there, instead of showing a random one of their children :( sorry

{
  "ID": "8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "Primary": "vault/sys/token/id/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "Lease": "vault/sys/expire/id/auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "SecondaryIndex": "",
  "ParentIndex": "",
  "Secret": {
    "request_id": "28987c46-dd71-a4dd-85dc-3b2560665764",
    "lease_id": "",
    "lease_duration": 0,
    "renewable": false,
    "data": {
      "expire_time": "2017-10-29T18:29:31.208498887Z",
      "id": "auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
      "issue_time": "2017-09-27T18:29:31.208493668Z",
      "last_renewal": "2017-09-27T18:29:31.437838733Z",
      "renewable": false,
      "ttl": -15796333
    },
    "warnings": null
  }
}

calvn · 2018-04-30T18:20:42Z

Ahh, so this was a mistake from the script that built these objects, and not with Vault itself. This would eliminate the theory that we were having.

dmicanzerofox · 2018-04-30T18:26:35Z

Yes! Sorry. This is top priority for us, and we can pretty much get you any data from the vault system datastore that you think might be helpful.

Thank you

calvn · 2018-04-30T18:57:44Z

@dmicanzerofox were you able to test with a branch based off #4465?

calvn · 2018-04-30T19:07:58Z

@burdandrei what version of Vault were you previously on?

burdandrei · 2018-05-01T08:47:24Z

@calvn
0.9.6
...
0.6.2
=)

SoMuchToGrok · 2018-05-01T11:40:25Z

@calvn - @dmicanzerofox and myself will be able to test that branch tomorrow (05/02). Will update with the results.

dmicanzerofox · 2018-05-02T13:03:16Z

@calvn

We deployed #4465 and tried to revoke one of the expired leases that are looping, but was unable to do so. Something that is very interesting is that on our previous version the revoke commands hung until the timeout was reached. The commands below exited immediately:

root@v-stag-4-194:/home/dmican# vault write sys/leases/lookup lease_id=auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
Key             Value
---             -----
expire_time     2017-11-04T15:49:30.21318925Z
id              auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
issue_time      2017-09-27T18:29:31.578314239Z
last_renewal    2017-10-03T15:49:30.213189473Z
renewable       false
ttl             -15455338
root@v-stag-4-194:/home/dmican# VAULT_CLIENT_TIMEOUT=600 vault revoke -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
WARNING! The "vault revoke" command is deprecated. Please use "vault lease
revoke" instead. This command will be removed in Vault 0.11 (or later).

Success! Revoked any leases with prefix: auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
root@v-stag-4-194:/home/dmican# vault write sys/leases/lookup lease_id=auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
Key             Value
---             -----
expire_time     2017-11-04T15:49:30.21318925Z
id              auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
issue_time      2017-09-27T18:29:31.578314239Z
last_renewal    2017-10-03T15:49:30.213189473Z
renewable       false
ttl             -15455363
root@v-stag-4-194:/home/dmican# VAULT_CLIENT_TIMEOUT=600 vault revoke -force=true -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
WARNING! The "vault revoke" command is deprecated. Please use "vault lease
revoke" instead. This command will be removed in Vault 0.11 (or later).

Warning! Force-removing leases can cause Vault to become out of sync with
secret engines!
Success! Force revoked any leases with prefix: auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
root@v-stag-4-194:/home/dmican# vault write sys/leases/lookup lease_id=auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
Key             Value
---             -----
expire_time     2017-11-04T15:49:30.21318925Z
id              auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
issue_time      2017-09-27T18:29:31.578314239Z
last_renewal    2017-10-03T15:49:30.213189473Z
renewable       false
ttl             -15455420

calvn · 2018-05-02T13:28:55Z

After some more investigation, we were able to identify and reproduce the issue. We are actively working on this and will let you know once we have a fix for you to test on.

calvn · 2018-05-08T19:03:17Z

@dmicanzerofox @burdandrei can you do a build off #4512 and give that a try? We did some internal refactoring on revocation mechanics which should address the blocking calls and infinite looping issues.

burdandrei · 2018-05-09T09:08:43Z

@calvn sure will, thanks!

dmicanzerofox · 2018-05-09T12:22:25Z

IT WORKED FOR US!!!!! Leases were revoked almost immediately and we were able to force a stepdown! @calvn++++++ @burdandrei !!!!!

calvn · 2018-05-09T14:03:52Z

Awesome, glad to hear!

jefferai · 2018-05-09T14:34:07Z

@dmicanzerofox @burdandrei thanks for all of the patience you've shown (and @SoMuchToGrok of course!) with us around this -- it wasn't easy to figure out the cause, and it wasn't easy to fix it either. On the other hand, we think the new revocation mechanics we put in place are better in many ways so it should be a win all around.

SoMuchToGrok · 2018-05-09T16:09:46Z

Absolutely! Glad to help out. I appreciate that the team was so receptive and never gave up! @dmicanzerofox and myself are definitely grateful for all the effort to get this resolved :)

jefferai added this to the 0.9.6 milestone Mar 18, 2018

jefferai modified the milestones: 0.9.6, 0.10 Mar 20, 2018

jefferai modified the milestones: 0.10, 0.10.1 Apr 10, 2018

jefferai mentioned this issue Apr 29, 2018

Vault 0.10.0 is flooding Consul backend with GET /v1/kv/vault/sys/token/parent/xxxxxxx/?keys=&separator=%2F requests (3K RPS) #4462

Closed

calvn mentioned this issue May 8, 2018

Token revocation refactor #4512

Merged

calvn mentioned this issue May 10, 2018

Vault loop - High CPU with token lookup #4179

Closed

calvn closed this as completed in #4512 May 10, 2018

burdandrei mentioned this issue Jul 12, 2018

ACL token rotation causes check status inconsistency hashicorp/consul#4372

Open

Strange token with negative TTL blocking Vault shutdown #4143

Strange token with negative TTL blocking Vault shutdown #4143

Comments

SoMuchToGrok commented Mar 16, 2018 • edited Loading

vishalnayak commented Mar 16, 2018

SoMuchToGrok commented Mar 16, 2018 • edited Loading

jefferai commented Mar 17, 2018

SoMuchToGrok commented Mar 19, 2018 • edited Loading

jefferai commented Mar 19, 2018

jefferai commented Mar 19, 2018

SoMuchToGrok commented Mar 19, 2018 • edited Loading

jefferai commented Mar 21, 2018

SoMuchToGrok commented Mar 22, 2018

SoMuchToGrok commented Mar 26, 2018 • edited Loading

jefferai commented Mar 26, 2018

SoMuchToGrok commented Mar 26, 2018

jefferai commented Mar 26, 2018

jefferai commented Mar 26, 2018

SoMuchToGrok commented Mar 26, 2018 • edited Loading

jefferai commented Mar 26, 2018

SoMuchToGrok commented Mar 26, 2018

jefferai commented Mar 26, 2018

SoMuchToGrok commented Mar 26, 2018 • edited Loading

jefferai commented Mar 28, 2018

SoMuchToGrok commented Mar 29, 2018

SoMuchToGrok commented Apr 10, 2018 • edited Loading

SoMuchToGrok commented Apr 11, 2018 • edited Loading

jefferai commented Apr 11, 2018

SoMuchToGrok commented Apr 11, 2018

dmicanzerofox commented Apr 28, 2018 • edited Loading

burdandrei commented Apr 29, 2018

dmicanzerofox commented Apr 30, 2018 • edited Loading

jefferai commented Apr 30, 2018

dmicanzerofox commented Apr 30, 2018 • edited Loading

jefferai commented Apr 30, 2018

calvn commented Apr 30, 2018

jefferai commented Apr 30, 2018

burdandrei commented Apr 30, 2018

dmicanzerofox commented Apr 30, 2018

calvn commented Apr 30, 2018

dmicanzerofox commented Apr 30, 2018 • edited Loading

calvn commented Apr 30, 2018

calvn commented Apr 30, 2018

burdandrei commented May 1, 2018

SoMuchToGrok commented May 1, 2018 • edited Loading

dmicanzerofox commented May 2, 2018

calvn commented May 2, 2018

calvn commented May 8, 2018

burdandrei commented May 9, 2018

dmicanzerofox commented May 9, 2018 • edited Loading

calvn commented May 9, 2018

jefferai commented May 9, 2018

SoMuchToGrok commented May 9, 2018

SoMuchToGrok commented Mar 16, 2018 •

edited

Loading

SoMuchToGrok commented Mar 16, 2018 •

edited

Loading

SoMuchToGrok commented Mar 19, 2018 •

edited

Loading

SoMuchToGrok commented Mar 19, 2018 •

edited

Loading

SoMuchToGrok commented Mar 26, 2018 •

edited

Loading

SoMuchToGrok commented Mar 26, 2018 •

edited

Loading

SoMuchToGrok commented Mar 26, 2018 •

edited

Loading

SoMuchToGrok commented Apr 10, 2018 •

edited

Loading

SoMuchToGrok commented Apr 11, 2018 •

edited

Loading

dmicanzerofox commented Apr 28, 2018 •

edited

Loading

dmicanzerofox commented Apr 30, 2018 •

edited

Loading

dmicanzerofox commented Apr 30, 2018 •

edited

Loading

dmicanzerofox commented Apr 30, 2018 •

edited

Loading

SoMuchToGrok commented May 1, 2018 •

edited

Loading

dmicanzerofox commented May 9, 2018 •

edited

Loading