-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange token with negative TTL blocking Vault shutdown #4143
Comments
@SoMuchToGrok I am not sure what the problem could be offhand. So, please don't take my words as a possible solutions. They are just things you might want to try, given the situation. Assuming you have the backup of the storage, can you try bringing up v0.7.3 again on it and revoke the problematic token? This might clear its lease and clear out the leases of the child tokens as well. Another possibility is to try to perform an incremental upgrade to the versions between 0.7.3 and the latest, while keeping an eye on the upgrade guides (https://www.vaultproject.io/guides/upgrading/index.html). This might help to figure out which version is failing to handle a proper upgrade. |
Thanks for the response @vishalnayak I have a backup of the storage, but I'm hesitant to restore it because it typically introduces new problems (it tries to revoke secrets that "old" Vault already revoked, and the "new" Vault has no knowledge of the secrets that "old" Vault issued, leaving around secrets that will never get cleaned up). We're using a variety of different secret backends, so this becomes a fairly cumbersome issue to deal with. If there's an approach that's reasonably easy that can mitigate these factors, I'd be okay trying this. I read through the Vault upgrade guide before upgrading, but nothing appeared to be particularly relevant. Given that the lease in question here had a negative TTL many months before we upgraded to v0.9.5, I'm not certain it's 100% (emphasis on 100%) related to the upgrade. With that said, the new version is clearly enforcing some behavior differently than before. I have yet to try revoking/force-revoking the problematic token yet on v0.9.5. It's something I plan to try, but I'd like to understand what's going on here before I start deleting anything. This problem is in my staging environment, so I'm trying to be extra careful as I only have 1 degree of separation from production. |
It'd be helpful if you can get a trace log; additionally when Vault is blocked if you can send a SIGQUIT it will stack trace which can help us figure out whether the token issue or something else is the cause of the hanging -- it might not be, and might be due to some changing of internal locking logic that happened in 0.9.5. As for the tokens, we've fixed a number of issues over time related to tokens not being cleaned up properly when they have expired (which doesn't allow using them, just not cleaning them up). It's possible this is something that can be fixed by running https://www.vaultproject.io/api/auth/token/index.html#tidy-tokens |
@jefferai Thanks for the assistance. I tried hitting auth/token/tidy but it didn't cleanup the tokens in question.
|
One initial comment from the earlier part of the logs: you have a lot of users / resources, across different backends (pki, aws, etc.) where the underlying data (user, etc.) appears to have been removed without Vault's action. Or, possibly, you reloaded Vault data from a backup state after it had already deleted these values. You'll need to use revoke-force (https://www.vaultproject.io/api/system/leases.html#revoke-force) to clean those up. |
Took a look at the stacktrace. It's cut off, but based on the very last line of it, I actually think you're hitting a deadlock that is currently fixed in master and will be a part of 0.9.6, which should be coming this week. |
Sorry about that - systemd most likely stopped capturing stdout/stderr too early. Here is the full stracktrace: |
I think this should be fixed in the just-released 0.9.6...please let us know! |
Thanks @jefferai I'll get back to this in a few days and will update with the results! |
We're still experiencing this issue on v0.9.6. For more context, I've never manually deleted anything from the storage backend, and I've never restored from a backup (I have in other environments, but not this one). This issue appears to be identical - #4179 |
Can you get another SIGQUIT stacktrace from 0.9.6? It would be good to see if the changes made from 0.9.5/0.9.6 have shifted the root cause. |
Stacktrace from 0.9.6 |
Been looking at the stacktrace. One question: When shutdown is blocked, are you still seeing events in the log? e.g. are those GETs happening while you're waiting for Vault to shut down? If so, how long do you wait before deciding Vault is stuck? Anything interesting in Vault's logs during that time? How long has Vault been up? I ask because I don't yet see evidence that Vault is blocked but I do see evidence that Vault is trying to remove large numbers of expired tokens. It appears like it might not be that Vault is blocked so much as so many things are happening at once due to reading in leases and trying to expire old ones that Vault simply hasn't gotten to servicing the request to shut down yet. I don't think Go's locks provide priority for write operations, so if a a few hundred or thousand requests are outstanding and each request is a network event it could take a while for the shutdown process to grab Vault's state lock successfully. |
Another question -- you said it's the same issue as #4179, can you explain why you think so? Did you find that by removing that parent prefix it's instant, e.g. #4179 (comment) ? |
When the shutdown is blocked, I don't see anything particularly relevant in the Vault logs - but I do see the GETs from the consul monitor (letting the monitor cmd run for a few hours + some cut/uniq magic confirms that we see the same GETs over and over). I've waited at most ~4 days before killing it. Haven't come across anything interesting in the logs in general. Prior to running into this issue (pre-upgrade), Vault had been deployed for roughly 1 year. I believe it's the same issue as #4179 because I see the same loop active from the ~moment Vault turns on until I'm forced to SIGQUIT (it's not just during a shutdown/step-down). It also appears that the parent token relationship he describes is somewhat similar to what I'm seeing here. Haven't tried removing the parent prefix yet as I'd like to be 100% confident in the issue at hand before I start touching things (this is my staging environment and I'm trying my best to treat it as production). With that said, I would be comfortable removing it manually if you think that's the best course of action. I'd be interested in knowing if #4179 "blocks" a shutdown. |
Good to know -- from above I didn't see it looping over the same IDs over and over but if you are seeing that then it does look similar. I'd prefer you not removing it manually yet as we're working on a patch and would be good to see if you can try that and have it work for you. |
Thanks @jefferai, sounds good. |
Any chance you're up for building from master and/or a branch and testing? |
Yup, can easily build from whatever. |
Still trying to figure this out (since in the other thread you indicated that the changes we put in based on that OP's report did not help you). I have a couple of theories, but they're not certain yet...and it may be that both are true, or neither. One is that in your logs there are a ton of leases that are not able to be revoked (AWS users that have since been deleted, postgres users that pg refuses to drop because objects depend on them). Those leases will be attempted to be revoked 6 times each (and then if it fails it'll stop for that session but still try again later). I don't know how many leases there are that these various tokens generated, but with each of those it might be trying to look up the associated token that was encoded in the leases. You could try using https://www.vaultproject.io/api/system/leases.html#revoke-force to clean those up (or for postgres change your revocation statements to force pg to drop the user, or so). That will tell Vault to give up trying to actually perform revocation, and allow it to continue cleaning up the leases locally. The negative TTL is actually a potential indicator here. I forget how the logic has changed between 0.7.3 and now, but I believe it will try to revoke leases attached to a token first, before the token itself is revoked (which is different from that token being usable, which it should not be if the TTL is expired). So seeing such tokens might indeed mean that it's not cleaned up because of the attached leases. The other possibility is that the hanging you're seeing trying to shut down is related to the expiration attempts, not the tokens. Basically, maybe there is a bad state if a shutdown is triggered while expirations are ongoing. One thing that would be very useful to understand here is if you see the same behavior on 0.9.4. In 0.9.5 we modified some of the sealing logic, and more in 0.9.6, and in general this was to remove some potential races/deadlocks, but it's possible we introduced a new one at the same time. If you see the same behavior in 0.9.4 we'll know it's not related to these changes, which would be a useful data point. |
Sounds good @jefferai, appreciate the additional information. I'll give this a try on 0.9.4 and will report back. Unfortunately might not be able to get around to it for a few days, but will post here as soon as it's done. |
Just an update - I will be getting around to this in a few days (ETA Wednesday/Thursday). It's been a ridiculous past few weeks for me :) |
Just tested it; appreciate your patience. Unfortunately, still experiencing the same issue on v0.9.4. Caveat with 0.9.4 - I had to compile the binary against go v1.10.1 because of the PKI/DNS SAN issue in go v1.10. |
OK, so that's actually good in a way -- it means that the changes in shutdown logic are not at fault. So I think we're back at "Vault keeps trying to revoke things it can't actually revoke". Can you work on either fixing your revocation sql or using revoke-force as appropriate? |
Sounds good - I will start by auditing my postgresql revoke statements. Is there a recommended approach for "detecting" these issues? Or are logs the quickest/most accurate way to discover these revocation failures? |
From my testing it looks like
|
@jefferai thanks for pointing me here. |
I'm not sure if it will help at all but attached is an analysis of the tokens that show up in the step down loop. |
@burdandrei did you come from an older version prior to 0.9.6? If so how far back? |
We finally took the step and tried to revoke the leases connected to the looping tokens:
|
@dmicanzerofox unfortunately that means that now your state will be different than the last log you sent. We found something odd in it and are still trying to figure out what could have caused it. |
@dmicanzerofox I noticed you edited the post where you uploaded bad.token.txt; can you re-post the old version of it? We found the oddity in that file that is no longer present in the newer one. |
I have a copy. |
@jefferai this cluster had every version from 0.5, or maybe even 0.4 :) |
@calvn yes! sorry, there was a "bug" in the original, I was misattributing the Parent reference. For the original txt file, I was checking to see if the token id was present in the parent entry. The following is an example from the first txt file.
After I posted i tried to fix the script so that the root nodes wouldn't have any value there, instead of showing a random one of their children :( sorry
|
Ahh, so this was a mistake from the script that built these objects, and not with Vault itself. This would eliminate the theory that we were having. |
Yes! Sorry. This is top priority for us, and we can pretty much get you any data from the vault system datastore that you think might be helpful. Thank you |
@dmicanzerofox were you able to test with a branch based off #4465? |
@burdandrei what version of Vault were you previously on? |
@calvn |
@calvn - @dmicanzerofox and myself will be able to test that branch tomorrow (05/02). Will update with the results. |
We deployed #4465 and tried to revoke one of the expired leases that are looping, but was unable to do so. Something that is very interesting is that on our previous version the revoke commands hung until the timeout was reached. The commands below exited immediately:
|
After some more investigation, we were able to identify and reproduce the issue. We are actively working on this and will let you know once we have a fix for you to test on. |
@dmicanzerofox @burdandrei can you do a build off #4512 and give that a try? We did some internal refactoring on revocation mechanics which should address the blocking calls and infinite looping issues. |
@calvn sure will, thanks! |
IT WORKED FOR US!!!!! Leases were revoked almost immediately and we were able to force a stepdown! @calvn++++++ @burdandrei !!!!! |
Awesome, glad to hear! |
@dmicanzerofox @burdandrei thanks for all of the patience you've shown (and @SoMuchToGrok of course!) with us around this -- it wasn't easy to figure out the cause, and it wasn't easy to fix it either. On the other hand, we think the new revocation mechanics we put in place are better in many ways so it should be a win all around. |
Absolutely! Glad to help out. I appreciate that the team was so receptive and never gave up! @dmicanzerofox and myself are definitely grateful for all the effort to get this resolved :) |
Vault Version: v0.9.5
Operating System/Architecture: Ubuntu 16.04.03
Vault Config File:
Issue:
I'm running into an issue where Vault’s shutdown is blocked in such a way that it stops serving secrets, but never gives up the lock (so the secondary node always stays inactive). This problem started after we upgraded to v0.9.5 from v0.7.3. It appears to be stuck in a loop trying to get the following keys from the storage backend:
Additionally, while Vault is active and serving secrets, we constantly see the same calls being made to the storage backend. We’ve traced all these back to a single parent token and the child tokens it has issued.
The first parent and child have leases, but no lease is found for any of the remaining children. These leases were created on v0.7.3.
Has anyone seen this before? I’m trying to figure out how it got into the state in the first place. Not sure how to properly clean this up. Have already tried doing a tidy with no luck.
Thanks in advance for any assistance!
The text was updated successfully, but these errors were encountered: