TLS failures in blocking Dial calls don't provide useful error messages #2031

jmillikin-stripe · 2018-04-27T22:07:08Z

What version of gRPC are you using?

v1.9.2, but I've verified the same issue exists at the latest release (v1.11.3).

What version of Go are you using (`go version`)?

1.10

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

I'm using a blocking Dial (grpc.DialContext(..., grpc.WithBlock())) to better account errors to the dial phase vs RPC calls. One of the backends had an unexpected CN in its TLS certificate, so the connection was failing. The error message returned from DialContext was only context deadline exceeded, and the useful errors were printed out into the info log.

I expected a blocking dial to return some indication of why it had failed, instead of only timing out. In particular, including the most recent transport error in the returned error value would be very useful for reporting connection problems.

Based on the conversation in #1855 it sounds like this behavior is intentional (or, at least, expected). Is there any way to either get access to the transport-layer errors, or somehow propagate them up into the returned error value?

The very useful error printed to the logs:
W0427 21:43:33.738993 35711 clientconn.go:1167] grpc: addrConn.createTransport failed to connect to {badname.example.com 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for goodname.example.com, not badname.example.com". Reconnecting...

The not very useful error returned from DialContext:
W0427 21:43:33.773061 35711 handlers.go:180] Error dialing backend "badname.example.com": context deadline exceeded

The text was updated successfully, but these errors were encountered:

jmillikin-stripe · 2018-04-27T22:11:36Z

Also, #1917 documents that RPCs with fail-fast behavior will somehow get access to the underlying connection error. My system isn't using fail-fast because we want the gRPC library to retry past transient network issues.

dfawley · 2018-04-30T17:27:43Z

I expected a blocking dial to return some indication of why it had failed, instead of only timing out. In particular, including the most recent transport error in the returned error value would be very useful for reporting connection problems.

The trouble is that this isn't really a "Dial" function, per se, despite the name. grpc.Dial creates a ClientConn containing a managed pool of connections. The blocking behavior of Dial is essentially to start up this pool and wait until a connection is established to one, or the deadline is met. Authentication problems could be a misconfiguration of a server or a client, and they could resolve themselves transparently to the client (e.g. if the server restarts and receives new certs). For this reason, we do not give up when authentication problems are encountered; we keep trying until the deadline is reached.

Changing the returned error may be possible, but it would be a behavior change. Anyone checking for Dial returning context.DeadlineExceeded could be broken by that change, because context.DeadlineExceeded is a value and not an interface, so it cannot be extended.

The longer-term plan is to deprecate Dial and replace it with a NewClient that is both clearer in its behavior, and can have the semantics we prefer.

My system isn't using fail-fast because we want the gRPC library to retry past transient network issues.

The difference between fail-fast (the default) and wait-for-ready RPCs is that WFR will remain pending through connection establishment errors. Once they are assigned to a connection, RPCs that encounter transient network errors will still result in failures. Auth failures are considered connection errors, so all your WFR RPCs will deadline exceed when this happens and you can't connect to any backend.

Note that our implementation of fail-fast used to be wrong ~5 months ago: it would not block when the client was in a "connecting" state, it would just fail. Now that that is resolved, it may be suitable for your use case.

dfawley · 2018-05-08T21:19:44Z

Update: I don't think we should change the behavior of Dial to return anything besides context.DeadlineExceeded in this case. That would be a behavior change and not a true bug fix, and it could break some users.

However, I do think we can and should modify the text of the error that comes from a wait-for-ready RPC when it exceeds its deadline while waiting for a subchannel to include the details from the last connection error, like we do with non-WFR RPCs when we encounter transient failure.

How does that sound, @jmillikin-stripe?

jmillikin-stripe · 2018-05-08T21:22:56Z

With that solution, would it be possible to obtain the underlying error from a failed blocking dial? As I said in the first post, we'd like to be able to have error accounting such that failure to connect to the remote will be reported before attempting to run an RPC.

dfawley · 2018-05-08T21:44:05Z

No. In that case, you'd want to do a non-blocking dial, and then poll connectivity state* until you get either TransientFailure or Connected (or some deadline). On TransientFailure, we could then add a method to our connectivity state API similar to what Java has** to provide the relevant connection error.

*: https://godoc.org/google.golang.org/grpc#ClientConn.GetState and https://godoc.org/google.golang.org/grpc#ClientConn.WaitForStateChange
**: https://grpc.io/grpc-java/javadoc/io/grpc/ConnectivityStateInfo.html#getStatus--

dfawley · 2018-06-14T19:57:26Z

I created a PR (#2055) as a proof of concept for this. In light of some offline discussions, I still plan on reworking it a bit.

dfawley · 2018-08-29T20:50:33Z

In light of #2266, if the TLS error is non-temporary, then using WithBlock in combination with FailOnNonTempDialError should get you the behavior you want.

skipor · 2019-07-04T12:33:33Z

Is there any progress? The problem greatly complicates the analysis of connection problems.

dfawley · 2020-03-06T16:38:05Z

Unfortunately, the TLS library doesn't make it readily apparent when a definite TLS error occurs vs. a transient connection error. One idea: we could attempt to detect temporary network errors and call any other errors permanent. (This would be done by the TLS wrapper in the credentials package.)

sethp-nr · 2020-03-06T18:44:01Z

I'm not sure the FailOnNonTempDialError interface is going to help us too much here. Without modifying the defer it's racy (the context deadline error will always overwrite any returned error, if it exists). It also presumes that the definition of Temporary is universal – and that any errors that are Temporary are therefore uninteresting to callers.

re:

In that case, you'd want to do a non-blocking dial, and then poll connectivity state* until you get either TransientFailure or Connected (or some deadline). On TransientFailure, we could then add a method to our connectivity state API similar to what Java has** to provide the relevant connection error.

I'm not sure I understand how what you're describing is different from grpc.Block(). What is the benefit to asking each user of the library to independently re-implement the same polling loop?

I understand that #3412 is an interface change. However, given the number of issues linked to this one from other go projects, it sounds like an interface that's desirable – would you consider something like that functionality behind a dial option?

sethp-nr · 2020-03-06T19:37:15Z

Also, per #3406 (comment) I believe this applies to more than just TLS misconfigurations – in Cluster API's case, we're providing a custom dialer that might exhibit a number of different types of non-TLS-related failures, and we'd want to surface those to the caller (ourselves).

dfawley · 2020-03-06T20:59:12Z

I'm not sure I understand how what you're describing is different from grpc.Block(). What is the benefit to asking each user of the library to independently re-implement the same polling loop?

Each user that wants to block until a connection is ready after client creation would need to do this, yes. (We could make a utility function that does the polling, if that helps, but it should be pretty simple.) This is not necessary, however. Many applications will do a non-blocking Dial and just start attempting RPCs, seeing the connection errors at that time. Note that being connected, while an important step in startup, is not necessarily an indication that the backend you've connected to is healthy and able to serve RPCs for the client. And the connection may be lost in the future, so it's not really necessary to do a blocking dial in gRPC in general. Clients will always need to be tolerant of RPC errors which may be connection errors.

would you consider something like that functionality behind a dial option?

Potentially. Longer-term I would like to get rid of Dial entirely (#1786), and not provide a blocking option at all for the replacement (NewClient).

I believe this applies to more than just TLS misconfigurations – in Cluster API's case, we're providing a custom dialer that might exhibit a number of different types of non-TLS-related failures, and we'd want to surface those to the caller (ourselves).

I think our guidance here would be the same. If you consider the error a permanent one, then implement Temporary() bool { return false } and a blocking Dial will stop and return the error to the caller immediately. Or don't do a blocking dial and inspect the errors returned when making the RPCs.

sethp-nr · 2020-03-06T22:22:50Z

At least in my testing, I could not get an RPC error to behave differently than the blocking dial – if I included a context.WithTimeout then all I'd get back from various RPCs is a context.DeadlineExceeded error. A complicating factor is that I don't actually hold a reference to the client, the etcd client interface I'm using does: I have no easy way to ask it about the state of the connection without a PR to their client library to expose that. As it stands, the shortest path I have to information about the health of the connection is the err value from either the RPC or Dial calls, which is always context deadline exceeded.

Note that being connected, while an important step in startup, is not necessarily an indication that the backend you've connected to is healthy and able to serve RPCs for the client.

Agreed. However, it is a strong indication that my connection configuration embeds the potential for successful communication: in the cases where that's not true, what I desire from the system is to help me understand what action I need to take, and context deadline exceeded is fairly sparse in that sense.

That's why I'm not sure I grok the Temporary axis – almost all connection failures are temporary, in my experience, but some of them require me to do something before my next connection attempt succeeds. Connection refused sometimes implies I should go start that service back up. Read timeout sometimes suggests I misconfigured part of the network between my client and my server. DNS resolution failures are frequently because the name of the server I'm connecting to has a typo in it.

What I (&, it seems, others) think I want out of a networking API is exactly something like grpc.Block() – I want to make a call that, upon success, tells me that there is at least the possibility that my future RPCs will succeed & gives me good context on failure as to why. Returning a non-Temporary error immediately is a nice optimization, but even the Temporary ones are usually informative.

sethp-nr · 2020-03-06T22:47:06Z

I should add that I'm coming from a very short-lived context, in the sense that it's not a client with an indefinite lifespan like a long-running application. It's more like a CLI tool in that it's going to attempt to do one or two things with the RPC client before shutting down – in the long-lived application case, I would strongly agree that grpc.Block() is not what I want.

Tools like etcdctl & crictl don't really have a way to respond to failure, temporary or not, than "well, here's what happened, goodbye."

sethp-nr · 2020-03-06T23:11:29Z

would you consider something like that functionality behind a dial option?

Potentially. Longer-term I would like to get rid of Dial entirely (#1786), and not provide a blocking option at all for the replacement (NewClient).

While we discuss the longer-term piece, I did put together a draft of what a dial option for this could look like: #3430

dfawley · 2020-04-23T22:18:58Z

I believe this issue can be closed:

Blocking dial calls can now use WithReturnConnectionError to see what happened after #3430, and RPCs should return the last connection error with #2777.

Let us know if there's anything else not covered by these changes.

dfawley added Type: Feature New features or improvements in behavior P2 labels Apr 30, 2018

dfawley mentioned this issue May 8, 2018

ClientConn: Introduce GetStatus method to get the latest connection error #2055

Closed

cyli mentioned this issue May 29, 2018

Bumps GRPC to 1.12.0 moby/swarmkit#2649

Merged

dfawley self-assigned this Jun 14, 2018

dweomer mentioned this issue Aug 24, 2018

grpc.WithBlock() in containerd.New() swallows underlying issue when attempting to connect containerd/containerd#2576

Closed

lyuxuan mentioned this issue Jun 11, 2019

Is there anyway to prevent reconnection attempt behavior in Credentials? #2863

Closed

stale bot added the stale label Sep 6, 2019

dfawley removed the stale label Sep 6, 2019

grpc deleted a comment from stale bot Sep 12, 2019

dfawley mentioned this issue Mar 6, 2020

grpc provides uninformative error messages, even when set to "block" #3406

Closed

dfawley mentioned this issue Mar 6, 2020

client: surface connection errors to callers #3412

Closed

sethp-nr mentioned this issue Mar 6, 2020

🐛 etcd client terseness kubernetes-sigs/cluster-api#2486

Closed

palourde mentioned this issue Apr 14, 2020

The init command does not return SSL errors sensu/sensu-go#3663

Closed

dfawley removed their assignment Apr 23, 2020

dfawley closed this as completed Apr 23, 2020

sethp-nr mentioned this issue May 1, 2020

[kubeadm control plane]: etcd communication errors are being swallowed kubernetes-sigs/cluster-api#2454

Closed

andrejtokarcik mentioned this issue Oct 27, 2020

mTLS authentication w/ tests andrejtokarcik/jobworker#4

Merged

github-actions bot locked as resolved and limited conversation to collaborators May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TLS failures in blocking Dial calls don't provide useful error messages #2031

TLS failures in blocking Dial calls don't provide useful error messages #2031

jmillikin-stripe commented Apr 27, 2018 •

edited

Loading

jmillikin-stripe commented Apr 27, 2018

dfawley commented Apr 30, 2018

dfawley commented May 8, 2018

jmillikin-stripe commented May 8, 2018

dfawley commented May 8, 2018

dfawley commented Jun 14, 2018

dfawley commented Aug 29, 2018

skipor commented Jul 4, 2019

dfawley commented Mar 6, 2020

sethp-nr commented Mar 6, 2020

sethp-nr commented Mar 6, 2020

dfawley commented Mar 6, 2020

sethp-nr commented Mar 6, 2020

sethp-nr commented Mar 6, 2020 •

edited

Loading

sethp-nr commented Mar 6, 2020

dfawley commented Apr 23, 2020

TLS failures in blocking Dial calls don't provide useful error messages #2031

TLS failures in blocking Dial calls don't provide useful error messages #2031

Comments

jmillikin-stripe commented Apr 27, 2018 • edited Loading

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

jmillikin-stripe commented Apr 27, 2018

dfawley commented Apr 30, 2018

dfawley commented May 8, 2018

jmillikin-stripe commented May 8, 2018

dfawley commented May 8, 2018

dfawley commented Jun 14, 2018

dfawley commented Aug 29, 2018

skipor commented Jul 4, 2019

dfawley commented Mar 6, 2020

sethp-nr commented Mar 6, 2020

sethp-nr commented Mar 6, 2020

dfawley commented Mar 6, 2020

sethp-nr commented Mar 6, 2020

sethp-nr commented Mar 6, 2020 • edited Loading

sethp-nr commented Mar 6, 2020

dfawley commented Apr 23, 2020

jmillikin-stripe commented Apr 27, 2018 •

edited

Loading

What version of Go are you using (`go version`)?

sethp-nr commented Mar 6, 2020 •

edited

Loading