-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic/crash on "unknown cert key type [email protected]" #2877
Comments
Are you sure it crashed the Vault instance completely? Panics generally crash only the request. e.g. if I want to do some debugging on a code path and put in a panic, I can send request after request down there and each one will generate a trace, and I can run other requests with no issue. As for why it panics, sadly the documentation for the function makes no indication that it can panic: Any chance you could send the cert, or tell us how you created it? That's a valid cert type, so I'd like to introspect such a cert. Was it created within the backend or did you pass it in? |
Yeah, after the request that node stopped handling requests, and I had to restart Vault to get it back up. I didn't check to see if the process had fully died -- i suspect it did not, as systemd didn't catch it and restart it itself. But it wasn't listening on its port anymore when I tried to do I need to get the key from the user that triggered this, or find out more about how he created his. I haven't yet been able to reproduce it on my own without info from them. But it did crash 3 times yesterday (hurray for HA!). |
That is odd... I've never seen a panic take down the listening port. Please do let us know about the key generation, in the mean time we can add a recover to catch the panic. |
@bkrodgers Also it'd be useful to know if the stacktraces are all the same, e.g. is it always from signing the cert, or sometimes from other functions. |
Ping #2877 -- but don't close yet in case there are more places.
This is odd -- it seems like it doesn't always crash it. I see panics that didn't cause the server to go down, but when it does go down the panic stack trace is the last thing logged. The difference I see is that in the instances where it does not crash, the stack trace starts with:
In this instances where it does crash, the stack trace starts with this, which is also what comes after the above lines in non-crash events:
If you want to see full logs from all three of my instances, I'm happy to email them to you. I'm still trying to track down what's different about keys that aren't working. One of my users regenerated his before I reached him and the new one worked. And as you pointed out, |
Breakthrough! The message that I suspect that somehow a user sent a signed certificate into Vault to be signed, instead of the unsigned public key. Sure enough, if I do the command below, I can generate the panic.
So the root cause is user error. However, are we confident the |
Also I wonder if it'd be good to do something to detect that the user passed a cert in instead of a public key and return them an error that indicates that's what they did. |
Hi Brian, Thanks for all the debugging! Any chance I can get both full stacktraces? |
Crashes Vault:
Does not crash vault:
|
Also, I checked out master with your patch above, and still get a panic. Message is different though. I have not seen Vault completely crash in my testing locally. Reproduction steps, assuming you already have a cert at
Trace:
|
On the crash vs non-crash, I use request forwarding. Is it possible the difference could be that it crashes it if it's handled via request forwarding, but not if it hits the active node directly? I do see that difference now that I look at the full traces more closely. |
Yes, that's exactly what's happening. The normal http mechanism recovers from a panic by limiting the scope to that request, but apparently grpc doesn't, so I'll have to do that myself. |
@bkrodgers any chance you'll be able to test a branch/PR? |
sure! |
Hi @bkrodgers , I haven't thought of a way to test this in an automated fashion because if it kills the Vault process the tests get unhappy too, but I believe the (Note: the panic will still be logged, but it shouldn't crash the Vault listener anymore) |
Been busy this morning, should be able to test later today. |
Looks like it stops the crash. I get two error messages, depending on whether I hit an active node or a standby node that does request forwarding. Not 100% sure which is which. EDIT -- stream error is from the active node, 500 with the empty message is from the standby.
(no message shown) As you mentioned, I still see the panic as well in my logs, but no crash. Ideally can we suppress the panic from being logged? It'd also be great if we could return a specific error for this scenario to let the user know they passed in a cert instead of an unsigned key, but if that's too much effort to detect, a general error msg instead of an empty 500 or a stream error would be fine. |
If useful, stack traces: Forwarding:
HTTP:
|
I realized looking at that there was a formatting error in the log message, I've pushed up a commit to fix that; I think this should also fix having an empty message with the 500.
Panics are logged (always) when they occur so that they can be provided to us. Any panic is a bug that should be fixed, but without the stack trace it's hard to sort out what/where the issue is. |
Normally yes, but this specific case will be logged if a user sends in a cert by mistake. I don't think that's something you want people to be filing bugs on (once we've got the crash fixed). Yes, hopefully users won't do that often, but I had more than one user do it as we rolled this out. We could also prevent the panic by adding a check on the input cert before it happens. The panic itself happens within the signing process, so a check before passing it to that function would prevent the panic entirely. |
As I said, panics are bugs that need to be fixed. |
Jeff -- sorry if I'm being dense here. You're saying it's still going to log the panic if a user provides this bad input, but also that panics are bugs that need to be fixed. Are you saying that the crypto library needs to fix a bug to handle this invalid input? If so, are you filing that bug? Or are you saying that yes, you'll do something in Vault to prevent the invalid input from getting as far as triggering the panic? |
@bkrodgers The latter |
We just started using the SSH CA backend, and today Vault crashed on me with this:
I can see where in https://github.com/golang/crypto/blob/master/ssh/certs.go it throws this message with a
panic
. Without even worrying about how a user ended up passing in a cert that triggered that or considering if that should be a supported key type, my top concern is why it crashes the Vault instance completely. Seems like it should be caught and reported as an error without any impact on the process.The text was updated successfully, but these errors were encountered: