Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange token with negative TTL blocking Vault shutdown #4143

Closed
SoMuchToGrok opened this issue Mar 16, 2018 · 66 comments
Closed

Strange token with negative TTL blocking Vault shutdown #4143

SoMuchToGrok opened this issue Mar 16, 2018 · 66 comments
Milestone

Comments

@SoMuchToGrok
Copy link

SoMuchToGrok commented Mar 16, 2018

  • Vault Version: v0.9.5

  • Operating System/Architecture: Ubuntu 16.04.03

Vault Config File:

backend "consul" {
  address = "127.0.0.1:8500"
  path = "vault"
  token = ""
}

listener "tcp" {
  address = "1.1.1.1:8200"
  tls_disable = 0
  tls_cert_file = "/vault/server.crt"
  tls_key_file = "/vault/server.key"
}

max_lease_ttl = "2880h"

Issue:

I'm running into an issue where Vault’s shutdown is blocked in such a way that it stops serving secrets, but never gives up the lock (so the secondary node always stays inactive). This problem started after we upgraded to v0.9.5 from v0.7.3. It appears to be stuck in a loop trying to get the following keys from the storage backend:

2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/cc5113c6c15b26c4c359407f35b49b0b33d8d6f7/?keys=&separator=%2F (1.026998ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/840bcfe07c0ce34bae584de5a5de6c1857c6d90a/?keys=&separator=%2F (1.029913ms) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/d85cf4528416f66409cc3b5d9433d58a39dc9cc4/?keys=&separator=%2F (956.545µs) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/888b51a27895d526857553f8738162661328d74d/?keys=&separator=%2F (975.086µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/dc067819860c1dd23f458b418fcd7aed0b1a697e/?keys=&separator=%2F (1.015359ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/9544625267777e034883cbd8a576de268d5242b0/?keys=&separator=%2F (962.588µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/61da133ade358ec06ae6e6f0152a57235574877f/?keys=&separator=%2F (1.139707ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/995823d5e90d6893865bcdaba8678b1fb0939b7a/?keys=&separator=%2F (1.003417ms) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/9dc76f1234d11481dcaf614885293c92c463ffe6/?keys=&separator=%2F (995.442µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/098e2b0af50e1dd46cdb3ed4ba03a0d869c93f7f/?keys=&separator=%2F (1.047521ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/3dd19860320685b7b34e99ac8cf8db8b2c83081c/?keys=&separator=%2F (975.041µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/9f357234883dccdad0caeeb78265589c24d35a12/?keys=&separator=%2F (1.040412ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/421cfdbc380512d04780ff506960cfd1743a3759/?keys=&separator=%2F (974.111µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/be4f6170632f168a81c599276cf93c48abd11beb/?keys=&separator=%2F (974.154µs) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/4d54cc468ba8b7089c51dee15dc5693209269c51/?keys=&separator=%2F (994.944µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/be85887a1bc57a393d903cf6cede8bfa8d083ffe/?keys=&separator=%2F (949.421µs) from=127.0.0.1:9954

Additionally, while Vault is active and serving secrets, we constantly see the same calls being made to the storage backend. We’ve traced all these back to a single parent token and the child tokens it has issued.

── 8561c9e01661bbcbcc2a038f1e7787246cadf40c
  ├──  61da133ade358ec06ae6e6f0152a57235574877f
        │   ├── 098e2b0af50e1dd46cdb3ed4ba03a0d869c93f7f
        │   ├── 3dd19860320685b7b34e99ac8cf8db8b2c83081c
        │   ├── 421cfdbc380512d04780ff506960cfd1743a3759
        │   ├── 4d54cc468ba8b7089c51dee15dc5693209269c51
        │   ├── 508c12f556d19093e9167d5dcacda376d10cc5d9
        │   ├── 560883d56509e907837d14d9c12f74921c7e6624
        │   ├── 6bacbce05311553b655a29141094fe352afa5427
        │   ├── 6c1ce5eec8d86db284c3aa8cf56c983614638f7c
        │   ├── 6e0249999410712a23d8c737a901624664f2fe94
        │   ├── 7b7b4283827160a365d6d9c8522103a5581b8736
        │   ├── 840bcfe07c0ce34bae584de5a5de6c1857c6d90a
        │   ├── 888b51a27895d526857553f8738162661328d74d
        │   ├── 9544625267777e034883cbd8a576de268d5242b0
        │   ├── 995823d5e90d6893865bcdaba8678b1fb0939b7a
        │   ├── 9dc76f1234d11481dcaf614885293c92c463ffe6
        │   ├── 9f357234883dccdad0caeeb78265589c24d35a12
        │   ├── ba619887b09ba9be87b8ab8d0a04172f6cda41c7
        │   ├── be4f6170632f168a81c599276cf93c48abd11beb
        │   ├── be85887a1bc57a393d903cf6cede8bfa8d083ffe
        │   ├── cc5113c6c15b26c4c359407f35b49b0b33d8d6f7
        │   ├── d85cf4528416f66409cc3b5d9433d58a39dc9cc4
        │   └── dc067819860c1dd23f458b418fcd7aed0b1a697e

The first parent and child have leases, but no lease is found for any of the remaining children. These leases were created on v0.7.3.

{
  "request_id": "d07e234f-a561-ce71-ff39-e768f20ee921",
  "lease_id": "",
  "renewable": false,
  "lease_duration": 0,
  "data": {
    "expire_time": "2017-10-29T18:29:31.208498887Z",
    "id": "auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
    "issue_time": "2017-09-27T18:29:31.208493668Z",
    "last_renewal": "2017-09-27T18:29:31.437838733Z",
    "renewable": false,
    "ttl": -11924360
  },
  "wrap_info": null,
  "warnings": null,
  "auth": null
}
{
  "request_id": "6346938a-83c4-891f-f264-1f02d2463011",
  "lease_id": "",
  "renewable": false,
  "lease_duration": 0,
  "data": {
    "expire_time": "2017-11-04T15:49:30.21318925Z",
    "id": "auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f",
    "issue_time": "2017-09-27T18:29:31.578314239Z",
    "last_renewal": "2017-10-03T15:49:30.213189473Z",
    "renewable": false,
    "ttl": -11327396
  },
  "wrap_info": null,
  "warnings": null,
  "auth": null
}

Has anyone seen this before? I’m trying to figure out how it got into the state in the first place. Not sure how to properly clean this up. Have already tried doing a tidy with no luck.

Thanks in advance for any assistance!

@vishalnayak
Copy link
Member

@SoMuchToGrok I am not sure what the problem could be offhand. So, please don't take my words as a possible solutions. They are just things you might want to try, given the situation.

Assuming you have the backup of the storage, can you try bringing up v0.7.3 again on it and revoke the problematic token? This might clear its lease and clear out the leases of the child tokens as well.

Another possibility is to try to perform an incremental upgrade to the versions between 0.7.3 and the latest, while keeping an eye on the upgrade guides (https://www.vaultproject.io/guides/upgrading/index.html). This might help to figure out which version is failing to handle a proper upgrade.

@SoMuchToGrok
Copy link
Author

SoMuchToGrok commented Mar 16, 2018

Thanks for the response @vishalnayak

I have a backup of the storage, but I'm hesitant to restore it because it typically introduces new problems (it tries to revoke secrets that "old" Vault already revoked, and the "new" Vault has no knowledge of the secrets that "old" Vault issued, leaving around secrets that will never get cleaned up). We're using a variety of different secret backends, so this becomes a fairly cumbersome issue to deal with. If there's an approach that's reasonably easy that can mitigate these factors, I'd be okay trying this.

I read through the Vault upgrade guide before upgrading, but nothing appeared to be particularly relevant. Given that the lease in question here had a negative TTL many months before we upgraded to v0.9.5, I'm not certain it's 100% (emphasis on 100%) related to the upgrade. With that said, the new version is clearly enforcing some behavior differently than before.

I have yet to try revoking/force-revoking the problematic token yet on v0.9.5. It's something I plan to try, but I'd like to understand what's going on here before I start deleting anything. This problem is in my staging environment, so I'm trying to be extra careful as I only have 1 degree of separation from production.

@jefferai
Copy link
Member

It'd be helpful if you can get a trace log; additionally when Vault is blocked if you can send a SIGQUIT it will stack trace which can help us figure out whether the token issue or something else is the cause of the hanging -- it might not be, and might be due to some changing of internal locking logic that happened in 0.9.5.

As for the tokens, we've fixed a number of issues over time related to tokens not being cleaned up properly when they have expired (which doesn't allow using them, just not cleaning them up). It's possible this is something that can be fixed by running https://www.vaultproject.io/api/auth/token/index.html#tidy-tokens

@jefferai jefferai added this to the 0.9.6 milestone Mar 18, 2018
@SoMuchToGrok
Copy link
Author

SoMuchToGrok commented Mar 19, 2018

@jefferai Thanks for the assistance. I tried hitting auth/token/tidy but it didn't cleanup the tokens in question.

  1. Here are the trace logs from the first start of Vault, to issuing a SIGQUIT (including the stack trace). Unfortunately I'm not seeing any of the tokens in question referenced in the logs, but hopefully this is helpful.
    https://pastebin.com/raw/xDyGLFes

  2. Here is the stacktrace we get when we issue a SIGQUIT after issuing a SIGINT.
    https://gist.github.com/dmicanzerofox/238d7b557951c786af9b36d654dd288f

@jefferai
Copy link
Member

One initial comment from the earlier part of the logs: you have a lot of users / resources, across different backends (pki, aws, etc.) where the underlying data (user, etc.) appears to have been removed without Vault's action. Or, possibly, you reloaded Vault data from a backup state after it had already deleted these values. You'll need to use revoke-force (https://www.vaultproject.io/api/system/leases.html#revoke-force) to clean those up.

@jefferai
Copy link
Member

Took a look at the stacktrace. It's cut off, but based on the very last line of it, I actually think you're hitting a deadlock that is currently fixed in master and will be a part of 0.9.6, which should be coming this week.

@SoMuchToGrok
Copy link
Author

SoMuchToGrok commented Mar 19, 2018

Sorry about that - systemd most likely stopped capturing stdout/stderr too early.

Here is the full stracktrace:
https://pastebin.com/raw/watS2FX5
(SIGQUIT after SIGINT)

@jefferai jefferai modified the milestones: 0.9.6, 0.10 Mar 20, 2018
@jefferai
Copy link
Member

I think this should be fixed in the just-released 0.9.6...please let us know!

@SoMuchToGrok
Copy link
Author

Thanks @jefferai

I'll get back to this in a few days and will update with the results!

@SoMuchToGrok
Copy link
Author

SoMuchToGrok commented Mar 26, 2018

We're still experiencing this issue on v0.9.6. For more context, I've never manually deleted anything from the storage backend, and I've never restored from a backup (I have in other environments, but not this one).

This issue appears to be identical - #4179

@jefferai
Copy link
Member

Can you get another SIGQUIT stacktrace from 0.9.6? It would be good to see if the changes made from 0.9.5/0.9.6 have shifted the root cause.

@SoMuchToGrok
Copy link
Author

Stacktrace from 0.9.6

https://pastebin.com/raw/MLh5jQwj

@jefferai
Copy link
Member

Been looking at the stacktrace. One question: When shutdown is blocked, are you still seeing events in the log? e.g. are those GETs happening while you're waiting for Vault to shut down? If so, how long do you wait before deciding Vault is stuck? Anything interesting in Vault's logs during that time? How long has Vault been up?

I ask because I don't yet see evidence that Vault is blocked but I do see evidence that Vault is trying to remove large numbers of expired tokens. It appears like it might not be that Vault is blocked so much as so many things are happening at once due to reading in leases and trying to expire old ones that Vault simply hasn't gotten to servicing the request to shut down yet. I don't think Go's locks provide priority for write operations, so if a a few hundred or thousand requests are outstanding and each request is a network event it could take a while for the shutdown process to grab Vault's state lock successfully.

@jefferai
Copy link
Member

Another question -- you said it's the same issue as #4179, can you explain why you think so? Did you find that by removing that parent prefix it's instant, e.g. #4179 (comment) ?

@SoMuchToGrok
Copy link
Author

SoMuchToGrok commented Mar 26, 2018

When the shutdown is blocked, I don't see anything particularly relevant in the Vault logs - but I do see the GETs from the consul monitor (letting the monitor cmd run for a few hours + some cut/uniq magic confirms that we see the same GETs over and over). I've waited at most ~4 days before killing it. Haven't come across anything interesting in the logs in general. Prior to running into this issue (pre-upgrade), Vault had been deployed for roughly 1 year.

I believe it's the same issue as #4179 because I see the same loop active from the ~moment Vault turns on until I'm forced to SIGQUIT (it's not just during a shutdown/step-down). It also appears that the parent token relationship he describes is somewhat similar to what I'm seeing here. Haven't tried removing the parent prefix yet as I'd like to be 100% confident in the issue at hand before I start touching things (this is my staging environment and I'm trying my best to treat it as production). With that said, I would be comfortable removing it manually if you think that's the best course of action.

I'd be interested in knowing if #4179 "blocks" a shutdown.

@jefferai
Copy link
Member

Good to know -- from above I didn't see it looping over the same IDs over and over but if you are seeing that then it does look similar. I'd prefer you not removing it manually yet as we're working on a patch and would be good to see if you can try that and have it work for you.

@SoMuchToGrok
Copy link
Author

Thanks @jefferai, sounds good.

@jefferai
Copy link
Member

Any chance you're up for building from master and/or a branch and testing?

@SoMuchToGrok
Copy link
Author

SoMuchToGrok commented Mar 26, 2018

Yup, can easily build from whatever.

@jefferai
Copy link
Member

Still trying to figure this out (since in the other thread you indicated that the changes we put in based on that OP's report did not help you).

I have a couple of theories, but they're not certain yet...and it may be that both are true, or neither.

One is that in your logs there are a ton of leases that are not able to be revoked (AWS users that have since been deleted, postgres users that pg refuses to drop because objects depend on them). Those leases will be attempted to be revoked 6 times each (and then if it fails it'll stop for that session but still try again later). I don't know how many leases there are that these various tokens generated, but with each of those it might be trying to look up the associated token that was encoded in the leases. You could try using https://www.vaultproject.io/api/system/leases.html#revoke-force to clean those up (or for postgres change your revocation statements to force pg to drop the user, or so). That will tell Vault to give up trying to actually perform revocation, and allow it to continue cleaning up the leases locally.

The negative TTL is actually a potential indicator here. I forget how the logic has changed between 0.7.3 and now, but I believe it will try to revoke leases attached to a token first, before the token itself is revoked (which is different from that token being usable, which it should not be if the TTL is expired). So seeing such tokens might indeed mean that it's not cleaned up because of the attached leases.

The other possibility is that the hanging you're seeing trying to shut down is related to the expiration attempts, not the tokens. Basically, maybe there is a bad state if a shutdown is triggered while expirations are ongoing. One thing that would be very useful to understand here is if you see the same behavior on 0.9.4. In 0.9.5 we modified some of the sealing logic, and more in 0.9.6, and in general this was to remove some potential races/deadlocks, but it's possible we introduced a new one at the same time. If you see the same behavior in 0.9.4 we'll know it's not related to these changes, which would be a useful data point.

@SoMuchToGrok
Copy link
Author

Sounds good @jefferai, appreciate the additional information.

I'll give this a try on 0.9.4 and will report back. Unfortunately might not be able to get around to it for a few days, but will post here as soon as it's done.

@SoMuchToGrok
Copy link
Author

SoMuchToGrok commented Apr 10, 2018

Just an update - I will be getting around to this in a few days (ETA Wednesday/Thursday). It's been a ridiculous past few weeks for me :)

@jefferai jefferai modified the milestones: 0.10, 0.10.1 Apr 10, 2018
@SoMuchToGrok
Copy link
Author

SoMuchToGrok commented Apr 11, 2018

Just tested it; appreciate your patience. Unfortunately, still experiencing the same issue on v0.9.4.

Caveat with 0.9.4 - I had to compile the binary against go v1.10.1 because of the PKI/DNS SAN issue in go v1.10.

@jefferai
Copy link
Member

OK, so that's actually good in a way -- it means that the changes in shutdown logic are not at fault. So I think we're back at "Vault keeps trying to revoke things it can't actually revoke". Can you work on either fixing your revocation sql or using revoke-force as appropriate?

@SoMuchToGrok
Copy link
Author

Sounds good - I will start by auditing my postgresql revoke statements.

Is there a recommended approach for "detecting" these issues? Or are logs the quickest/most accurate way to discover these revocation failures?

@dmicanzerofox
Copy link
Contributor

dmicanzerofox commented Apr 28, 2018

From my testing it looks like revokTreeSalted can can cycle, like if a child token was some how the parent of one of the parents. Creating the token entries through the public method prohibited the cycle, but crafting them manual should illustrate it . (Also I don't think we actually have a cycle :p I just noticed it while doing this investigation)

+/*
+2018/03/12 12:15:51 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/6c1ce5eec8d86db284c3aa8cf56c983614638f7c/?keys=&separator=%2F
+(1.040345ms) from=127.0.0.1:39994
+ */
+func TestTokentStore_RevokeTree_CycleLogs(t *testing.T) {
+       _, ts, _, _ := TestCoreWithTokenStore(t)
+
+       root1 := &TokenEntry{
+               Path: "parent/1/",
+               ID: "1",
+       }
+       if err := ts.create(context.Background(), root1); err != nil {
+               t.Fatalf("err: %v", err)
+       }
+
+       root2 := &TokenEntry{
+               Path: "parent/2/",
+               ID: "2",
+       }
+       if err := ts.create(context.Background(), root2); err != nil {
+               t.Fatalf("err: %v", err)
+       }
+
+       // add the second Root as a child of the parent
+       entry := &TokenEntry{
+               Parent: root1.ID,
+               ID: "3",
+       }
+       if err := ts.create(context.Background(), entry); err != nil {
+               t.Fatalf("err: %v", err)
+       }
+       /*
+       ctx := context.Background()
+
+       entry.Policies = policyutil.SanitizePolicies(entry.Policies, policyutil.DoNotAddDefaultPolicy)
+
+       err := ts.createAccessor(ctx, entry)
+       if err != nil {
+               t.Fatalf("err: %v", err)
+       }
+
+       ts.storeCommon(ctx, entry, true)
+       */
+       err := ts.RevokeTree(context.Background(), root2.ID)
+       if err != nil {
+               t.Fatalf("err: %v", err)
+       }
 }

@burdandrei
Copy link
Contributor

@jefferai thanks for pointing me here.
We updated 0.9.6 to 0.10.0.
I noticed the issue, cause Vault is running on Consul server Instance, and we were using t2 instance. so after more than the 24-hour loop, we ran out of CPU cycles and I noticed the behavior.
never restored backend from the backup.
But we had a problem updating to 0.9.0 and using database(MongoDB) mount - timeouts and really strange things.

@dmicanzerofox
Copy link
Contributor

dmicanzerofox commented Apr 30, 2018

I'm not sure if it will help at all but attached is an analysis of the tokens that show up in the step down loop.

bad.tokens.txt

@jefferai
Copy link
Member

@burdandrei did you come from an older version prior to 0.9.6? If so how far back?

@dmicanzerofox
Copy link
Contributor

dmicanzerofox commented Apr 30, 2018

We finally took the step and tried to revoke the leases connected to the looping tokens:

VAULT_CLIENT_TIMEOUT=600 vault revoke -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f

# Error revoking leases with prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: Put https://x.x.x.x:8200/v1/sys/leases/revoke-prefix/auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
VAULT_CLIENT_TIMEOUT=600 vault revoke -force=true -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f

# Error revoking leases with prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: Put https://x.x.x.x:8200/v1/sys/leases/revoke-prefix/auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
```

Both commands froze until timeout was reached. 

@jefferai
Copy link
Member

@dmicanzerofox unfortunately that means that now your state will be different than the last log you sent. We found something odd in it and are still trying to figure out what could have caused it.

@calvn
Copy link
Contributor

calvn commented Apr 30, 2018

@dmicanzerofox I noticed you edited the post where you uploaded bad.token.txt; can you re-post the old version of it? We found the oddity in that file that is no longer present in the newer one.

@jefferai
Copy link
Member

I have a copy.
bad.tokens.txt

@burdandrei
Copy link
Contributor

@jefferai this cluster had every version from 0.5, or maybe even 0.4 :)

@dmicanzerofox
Copy link
Contributor

@calvn yes! sorry, there was a "bug" in the original, I was misattributing the Parent reference.

For the original txt file, I was checking to see if the token id was present in the parent entry. The following is an example from the first txt file.

{
  "ID": "8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "Primary": "vault/sys/token/id/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "Lease": "vault/sys/expire/id/auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "SecondaryIndex": "",
  "ParentIndex": "vault/sys/token/parent/8561c9e01661bbcbcc2a038f1e7787246cadf40c/61da133ade358ec06ae6e6f0152a57235574877f",
  "Secret": {
    "request_id": "4a1dd037-d0c1-08be-2c14-93e8b19604ce",
    "lease_id": "",
    "lease_duration": 0,
    "renewable": false,
    "data": {
      "expire_time": "2017-10-29T18:29:31.208498887Z",
      "id": "auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
      "issue_time": "2017-09-27T18:29:31.208493668Z",
      "last_renewal": "2017-09-27T18:29:31.437838733Z",
      "renewable": false,
      "ttl": -15557402
    },
    "warnings": null
  }
}

After I posted i tried to fix the script so that the root nodes wouldn't have any value there, instead of showing a random one of their children :( sorry

{
  "ID": "8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "Primary": "vault/sys/token/id/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "Lease": "vault/sys/expire/id/auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "SecondaryIndex": "",
  "ParentIndex": "",
  "Secret": {
    "request_id": "28987c46-dd71-a4dd-85dc-3b2560665764",
    "lease_id": "",
    "lease_duration": 0,
    "renewable": false,
    "data": {
      "expire_time": "2017-10-29T18:29:31.208498887Z",
      "id": "auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
      "issue_time": "2017-09-27T18:29:31.208493668Z",
      "last_renewal": "2017-09-27T18:29:31.437838733Z",
      "renewable": false,
      "ttl": -15796333
    },
    "warnings": null
  }
}

@calvn
Copy link
Contributor

calvn commented Apr 30, 2018

Ahh, so this was a mistake from the script that built these objects, and not with Vault itself. This would eliminate the theory that we were having.

@dmicanzerofox
Copy link
Contributor

dmicanzerofox commented Apr 30, 2018

Yes! Sorry. This is top priority for us, and we can pretty much get you any data from the vault system datastore that you think might be helpful.

Thank you

@calvn
Copy link
Contributor

calvn commented Apr 30, 2018

@dmicanzerofox were you able to test with a branch based off #4465?

@calvn
Copy link
Contributor

calvn commented Apr 30, 2018

@burdandrei what version of Vault were you previously on?

@burdandrei
Copy link
Contributor

@calvn
0.9.6
...
0.6.2
=)

@SoMuchToGrok
Copy link
Author

SoMuchToGrok commented May 1, 2018

@calvn - @dmicanzerofox and myself will be able to test that branch tomorrow (05/02). Will update with the results.

@dmicanzerofox
Copy link
Contributor

@calvn

We deployed #4465 and tried to revoke one of the expired leases that are looping, but was unable to do so. Something that is very interesting is that on our previous version the revoke commands hung until the timeout was reached. The commands below exited immediately:

root@v-stag-4-194:/home/dmican# vault write sys/leases/lookup lease_id=auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
Key             Value
---             -----
expire_time     2017-11-04T15:49:30.21318925Z
id              auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
issue_time      2017-09-27T18:29:31.578314239Z
last_renewal    2017-10-03T15:49:30.213189473Z
renewable       false
ttl             -15455338
root@v-stag-4-194:/home/dmican# VAULT_CLIENT_TIMEOUT=600 vault revoke -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
WARNING! The "vault revoke" command is deprecated. Please use "vault lease
revoke" instead. This command will be removed in Vault 0.11 (or later).

Success! Revoked any leases with prefix: auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
root@v-stag-4-194:/home/dmican# vault write sys/leases/lookup lease_id=auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
Key             Value
---             -----
expire_time     2017-11-04T15:49:30.21318925Z
id              auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
issue_time      2017-09-27T18:29:31.578314239Z
last_renewal    2017-10-03T15:49:30.213189473Z
renewable       false
ttl             -15455363
root@v-stag-4-194:/home/dmican# VAULT_CLIENT_TIMEOUT=600 vault revoke -force=true -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
WARNING! The "vault revoke" command is deprecated. Please use "vault lease
revoke" instead. This command will be removed in Vault 0.11 (or later).

Warning! Force-removing leases can cause Vault to become out of sync with
secret engines!
Success! Force revoked any leases with prefix: auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
root@v-stag-4-194:/home/dmican# vault write sys/leases/lookup lease_id=auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
Key             Value
---             -----
expire_time     2017-11-04T15:49:30.21318925Z
id              auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
issue_time      2017-09-27T18:29:31.578314239Z
last_renewal    2017-10-03T15:49:30.213189473Z
renewable       false
ttl             -15455420

@calvn
Copy link
Contributor

calvn commented May 2, 2018

After some more investigation, we were able to identify and reproduce the issue. We are actively working on this and will let you know once we have a fix for you to test on.

@calvn
Copy link
Contributor

calvn commented May 8, 2018

@dmicanzerofox @burdandrei can you do a build off #4512 and give that a try? We did some internal refactoring on revocation mechanics which should address the blocking calls and infinite looping issues.

@burdandrei
Copy link
Contributor

@calvn sure will, thanks!

@dmicanzerofox
Copy link
Contributor

dmicanzerofox commented May 9, 2018

IT WORKED FOR US!!!!! Leases were revoked almost immediately and we were able to force a stepdown! @calvn++++++ @burdandrei !!!!!

@calvn
Copy link
Contributor

calvn commented May 9, 2018

Awesome, glad to hear!

@jefferai
Copy link
Member

jefferai commented May 9, 2018

@dmicanzerofox @burdandrei thanks for all of the patience you've shown (and @SoMuchToGrok of course!) with us around this -- it wasn't easy to figure out the cause, and it wasn't easy to fix it either. On the other hand, we think the new revocation mechanics we put in place are better in many ways so it should be a win all around.

@SoMuchToGrok
Copy link
Author

Absolutely! Glad to help out. I appreciate that the team was so receptive and never gave up! @dmicanzerofox and myself are definitely grateful for all the effort to get this resolved :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants