-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLS failures in blocking Dial calls don't provide useful error messages #2031
Comments
Also, #1917 documents that RPCs with fail-fast behavior will somehow get access to the underlying connection error. My system isn't using fail-fast because we want the gRPC library to retry past transient network issues. |
The trouble is that this isn't really a "Dial" function, per se, despite the name. Changing the returned error may be possible, but it would be a behavior change. Anyone checking for The longer-term plan is to deprecate
The difference between fail-fast (the default) and wait-for-ready RPCs is that WFR will remain pending through connection establishment errors. Once they are assigned to a connection, RPCs that encounter transient network errors will still result in failures. Auth failures are considered connection errors, so all your WFR RPCs will deadline exceed when this happens and you can't connect to any backend. Note that our implementation of fail-fast used to be wrong ~5 months ago: it would not block when the client was in a "connecting" state, it would just fail. Now that that is resolved, it may be suitable for your use case. |
Update: I don't think we should change the behavior of Dial to return anything besides However, I do think we can and should modify the text of the error that comes from a wait-for-ready RPC when it exceeds its deadline while waiting for a subchannel to include the details from the last connection error, like we do with non-WFR RPCs when we encounter transient failure. How does that sound, @jmillikin-stripe? |
With that solution, would it be possible to obtain the underlying error from a failed blocking dial? As I said in the first post, we'd like to be able to have error accounting such that failure to connect to the remote will be reported before attempting to run an RPC. |
No. In that case, you'd want to do a non-blocking dial, and then poll connectivity state* until you get either *: https://godoc.org/google.golang.org/grpc#ClientConn.GetState and https://godoc.org/google.golang.org/grpc#ClientConn.WaitForStateChange |
I created a PR (#2055) as a proof of concept for this. In light of some offline discussions, I still plan on reworking it a bit. |
In light of #2266, if the TLS error is non-temporary, then using |
Is there any progress? The problem greatly complicates the analysis of connection problems. |
Unfortunately, the TLS library doesn't make it readily apparent when a definite TLS error occurs vs. a transient connection error. One idea: we could attempt to detect temporary network errors and call any other errors permanent. (This would be done by the TLS wrapper in the |
I'm not sure the re:
I'm not sure I understand how what you're describing is different from I understand that #3412 is an interface change. However, given the number of issues linked to this one from other go projects, it sounds like an interface that's desirable – would you consider something like that functionality behind a dial option? |
Also, per #3406 (comment) I believe this applies to more than just TLS misconfigurations – in Cluster API's case, we're providing a custom dialer that might exhibit a number of different types of non-TLS-related failures, and we'd want to surface those to the caller (ourselves). |
Each user that wants to block until a connection is ready after client creation would need to do this, yes. (We could make a utility function that does the polling, if that helps, but it should be pretty simple.) This is not necessary, however. Many applications will do a non-blocking
Potentially. Longer-term I would like to get rid of
I think our guidance here would be the same. If you consider the error a permanent one, then implement |
At least in my testing, I could not get an RPC error to behave differently than the blocking dial – if I included a
Agreed. However, it is a strong indication that my connection configuration embeds the potential for successful communication: in the cases where that's not true, what I desire from the system is to help me understand what action I need to take, and That's why I'm not sure I grok the What I (&, it seems, others) think I want out of a networking API is exactly something like |
I should add that I'm coming from a very short-lived context, in the sense that it's not a client with an indefinite lifespan like a long-running application. It's more like a CLI tool in that it's going to attempt to do one or two things with the RPC client before shutting down – in the long-lived application case, I would strongly agree that Tools like |
While we discuss the longer-term piece, I did put together a draft of what a dial option for this could look like: #3430 |
What version of gRPC are you using?
v1.9.2, but I've verified the same issue exists at the latest release (v1.11.3).
What version of Go are you using (
go version
)?1.10
What operating system (Linux, Windows, …) and version?
Linux
What did you do?
I'm using a blocking Dial (
grpc.DialContext(..., grpc.WithBlock())
) to better account errors to the dial phase vs RPC calls. One of the backends had an unexpected CN in its TLS certificate, so the connection was failing. The error message returned fromDialContext
was onlycontext deadline exceeded
, and the useful errors were printed out into the info log.I expected a blocking dial to return some indication of why it had failed, instead of only timing out. In particular, including the most recent transport error in the returned error value would be very useful for reporting connection problems.
Based on the conversation in #1855 it sounds like this behavior is intentional (or, at least, expected). Is there any way to either get access to the transport-layer errors, or somehow propagate them up into the returned error value?
The very useful error printed to the logs:
W0427 21:43:33.738993 35711 clientconn.go:1167] grpc: addrConn.createTransport failed to connect to {badname.example.com 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for goodname.example.com, not badname.example.com". Reconnecting...
The not very useful error returned from
DialContext
:W0427 21:43:33.773061 35711 handlers.go:180] Error dialing backend "badname.example.com": context deadline exceeded
The text was updated successfully, but these errors were encountered: