-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
approle: invalid secret id / status 500 on lookup #4955
Comments
Hi there, The AppRole backend does not cache values. There is a global storage cache for Vault, but it's the same one that's been in there since the beginning and is the same across storage backends, so I don't suspect an issue. However, you could try turning off the cache via What I do suspect is Azure storage. The code path producing this error is a straight pass through to storage, and the only way it'd be nil (triggering this error) is if the underlying storage says it's nil (as opposed to returning an error). My guess is that you're either hitting an eventual consistency issue within Azure's storage, some kind of caching that is performed by the Azure SDK, some bug in the SDK, or some bug in the (community-supported) Azure storage plugin for Vault. When Vault is restarted, all of the last three are reinitialized; and the intervening time might allow time for Azure to catch up if it's the first option. (Disabling Vault's cache massively slows it down so even if it's an Azure consistency issue disabling the cache might also seem to work). |
Just to walk you through my analysis a bit, the message you're seeing is output when this function returns nil but not an error: https://github.com/hashicorp/vault/blob/master/builtin/credential/approle/validation.go#L262 There is only one place where that happens and that's if the storage Get call fails. The key here is that it works after a reboot; if there was something wrong with, say, the calculation of the salt, it would be the same across reboots. So it suggests that it's not that things are being stored improperly, but rather than the underlying storage is acting unreliably. |
Thanks, Jeff, for looking into this. I will turn off the global Vault cache - if only to see whether this makes indeed a difference. If I understood you correctly, then the performance hit might be too heavy to keep it this way in the long run, even if it makes the failures disappear. I will also try to trace what's going on in the Azure storage plugin and in the Azure SDK. I would not be surprised if writes into the Azure storage sometimes are eventually visible only. I am not clear yet how this fits with the behavior I observed (reproducibly) today:
So something must be overwritten inside the Azure storage when we hit this failure scenario - otherwise everything should be fine again after a vault restart. In the worst case, we will need to migrate to another backend. I am not keen to do this because migrating the pki backends will require more planning. But we'll see. Maybe I can find the cause in the Azure storage plugin. |
So you have two separate issues now? One that is fixed after a restart, and one that isn't? |
Also, did you get the same exact errors or were they different? Can you paste the errors and any logs you have? |
Can you also provide examples of the steps, such as role configuration and secret-ID fetching? Is it possible that the secret ID that "stayed broken" has limited uses or limited TTL and is simply expiring? Output from |
To be honest, I am not sure whether it is the same issue or not. This morning again, I had to restart vault; only then the deployment of our service succeeded. (The deployment script issues a new secret-id every time.) Whether it's related to cold caches, I don't know. Only this afternoon I realized / observed the behavior I described in my last comment. Our secret ids have no explicit TTL. The "broken" secret ids were all issued for the role While collecting the commands and output below, I found that the newly issued secret ids "broke" even without access to another secret id in the meantime. I therefore suspect that it is in fact a caching issue - something (in the Azure storage plugin??) is not properly persisted and as soon as the cache is updated from the storage, the secret id becomes inaccessible. The access to the "already broken" secret id was a probably red hering, it simply incurred the delay or forced the cache update that triggered the issue. Anyway, here's the output I see (still with
So as shown by the secret id's I guess the The attached vault-messages-2018-07-20.txt contains the vault logs from the relevant time period between creation of the secret id with accessor
You asked for the auth plugin details:
|
In the meantime, I restarted vault with |
I added log statements to the functions
I am baffled - even if there was a bug in either the Azure storage backend or the Azure SDK, why did the the issue become so easily reproducible only in the past week? Why does it seem to affect the role |
I'm baffled too but based on your own investigation it does seem like Vault isn't triggering the deletion, so as long as it's not writing the entry with some kind of expiration I don't see how it wouldn't be an issue on the Azure side. But I agree it's odd that you only see it for this role. Is there any way to get logs from Azure of access to that entry? It would be really nice to see if any other entity somehow was accessing that value. Maybe some automated process set up at some point is ended up being triggered by the name of the role. Grasping at straws but without logs from Azure it's all speculation. |
I turned on access logging for the relevant storage account. There's always an explicit DELETE request for the accessor entry from the vault server's TCP address, so it's not a bug in the implementation of the Azure blob storage. I made two further observations:
I made the
This means Just by reading the code, I saw only one explanation how this could happen: a) The anonymous go function inside If the |
Thanks for doing that tracing, that's super helpful. I'll take a look at it soon. When you say the role used to exist with that name, do you mean that it was created with that name but you now access it via lowercase, or do you mean that you had it as that name, then removed the role, and then recreated the role entirely as lowercase? |
No problem - I have a vested interest in this issue being resolved. :-) The role was created (a long time ago with vault 0.6.x) as |
OK so any secret IDs ought to have been from the new role. Can you see if ch-service-CHinteg still exists in your Azure storage? |
You mean inside |
Yep, that's what I was asking. Just trying to think through possibilities. |
Outside of the error shadowing, the problem as you identified earlier is that the Get function is coming back with nil for the entry. When you turned off Vault's cache, you took that out of the equation so it's not the cache. Any chance you can add a check in the azure storage code in Get where you look for a key with that prefix but that comes back with nil? It would be great to understand whether it's Azure returning nil or whether something is happening along the way in Vault. |
My
If a
I searched for the secret id entry mentioned in the excerpt above in the rest of yesterday's logs; it is not mentioned anymore. I can add log statements to If you want me to dump any specific fields of secret ids, accessors, roles etc., just let me know in which functions. Just to be sure, I can also repeat my experiment with the disabled cache. It probably has to wait until tomorrow, though. |
Looking into this agian. I put in a commit into |
I'm not seeing anything else obvious -- with that commit in there, can you look for instances where you see that logged message and compare the logged hmac'd secret ID with the logs you have in the Get function? If you can verify that Azure is returning a non-nil entry, and then it's getting turned to nil somewhere along the way, that would be great. (I mean, not great, but at least would help verify that the next step is to add debugging all throughout to figure out where that's changing.) |
I have an idea, but it will take me a while (probably tomorrow) to get a build to you with changes. Hopefully once I do you can test it soon after so we can try to get a fix into 0.10.4. |
I think I have it sorted -- can you run the |
Thanks a lot! Unfortunately, deploying the branch Nevertheless, it gave me a clue:
I therefore suspect that the initialization of I am going to add log statements to the Azure backend's |
I found it! The bug is in the The call to I guess since some days the role HMAC |
If I prepare a pull request, do you require a Contributor License Agreement or similar? I need to obtain the "ok" from my department head anyway, but I expect this will be a mere formality. Or do you prefer to write the fix yourself? |
Awesome that you found it! Glad my branch helped -- the good news is that what I found was a very real race condition so that line of investigation still bore a lot of fruit. One of our devs is happy to code it up -- we are trying to release 0.10.4 very soon so would like to see this fix in. |
(removed, mistaken post) |
Haha. I somehow switched tabs in the middle of typing. Redacting the above (it was meant to go onto #4981, where it is still public). |
Ok, that's fine with me. I have a local version with a fix, but it's a simple loop so your colleague will probably code it in no time either. For me, the most time consuming part was to google how you write a do-while-loop in Golang. :-) |
@chludwig-haufe Can you test out https://github.com/hashicorp/vault/pull/4983/files and see if it helps out? |
@chrishoffman With a build of my local patch deployed, the accessor for a Since your change to the Azure physical backend's |
Closing for now, please tell us how it goes! |
@chrishoffman The accessor of a secret id for Thanks a lot to you and @jefferai for looking into this issue so quickly! |
Happy to help! |
Describe the bug
When our Vault instance has been up for some time (by now about a day), it starts to fail to look up (some) approle secret ids. Applications that try to log in using an affected secret id get the message "invalid secret id". If someone tries to look up the secret id metadata using the corresponding secret id accessor, the response status is 500 with error message "failed to find accessor entry for secret_id_accessor". This also affects secret ids that were issued moments before.
After a restart of the vault server, freshly issued secret ids can be accessed again.
This issue might be related to #4396.
To Reproduce
The server logs do not reveal any extra information.
After a restart of our vault:
Expected behavior
The secret id access should always be successful like in the example above after the vault restart.
Environment:
vault status
):vault version
):Vault v0.10.3 ('533003e27840d9646cb4e7d23b3a113895da1dd0')
.org.springframework.cloud:spring-cloud-vault-config:1.1.1.RELEASE
.CentOS Linux release 7.5.1804
on an Azure VM Standard DS2 v2 Promo (2 vcpus, 7 GB memory)Vault server configuration file(s):
Additional context
ch-service-CHinteg
- back then, the role names were still case sensitive. Since we found the behavior with respect to existing roles confusing after the role names became case insensitive, we deleted the role and re-created it again. We also tried using the role namech-service-CHinteg
when issuing and accessing the secret id - to no avail.The text was updated successfully, but these errors were encountered: