Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc.FailOnNonTempDialError doesn't fail with wrong TLS certificate #6593

Closed
olegbespalov opened this issue Aug 29, 2023 · 11 comments
Closed

Comments

@olegbespalov
Copy link

olegbespalov commented Aug 29, 2023

What version of gRPC are you using?

google.golang.org/grpc v1.55.0

What version of Go are you using (go version)?

go version go1.21.0 linux/amd64

What operating system (Linux, Windows, …) and version?

x86_64 GNU/Linux

What did you do?

So we have a code dependent on the grpc library, and we were trying to establish a connection by using the:

conn, err := grpc.DialContext(ctx, addr, options...)
if err != nil {
	return nil, err
}

where options looks like:

tcred := credentials.NewTLS(tlsCfg)

options := []grpc.DialOption{
   grpc.WithBlock(),
   grpc.FailOnNonTempDialError(true),
   grpc.WithReturnConnectionError(),
   grpc.WithTransportCredentials(tcred)
}

Some of our tests intentionally used the wrong TLS certificate.

What did you expect to see?

So I expected to see the error about the bad certificate almost immediately.

What did you see instead?

But instead, I found that we were waiting till the context to reach the deadline, and only after that could I get the error.

I started to investigate, and it turned out that the error of a bad certificate is treated as a temporary error, which means that the transport will try to do some retries on that.

image

Looking into the code, I see that the place

return connectionErrorf(true, err, "error reading server preface: %v", err)

doesn't use the helper isTemporary, but instead passes the static true to the connectionError's constructor.

So I believe the fix is pretty straightforward like:

-return connectionErrorf(true, err, "error reading server preface: %v", err)
+return connectionErrorf(isTemporary(err), err, "error reading server preface: %v", err)
@olegbespalov
Copy link
Author

If needed, I could also try to prepare a code/repo that shows the issue, or I could also try to write a test, but I am not sure about the time frame of it since I am not familiar with the grpc-go code base.

@atollena
Copy link
Collaborator

My guess is that this is working as intended: DialContext with WithBlock only returns an error if context has expired AND there is still no usable server connection established. It can't return before context has expired.

@olegbespalov
Copy link
Author

olegbespalov commented Aug 30, 2023

I assume that it shouldn't wait for the context since we also use the grpc.FailOnNonTempDialError(true).

Actually, the code is trying to check and exit earlier:

grpc-go/clientconn.go

Lines 285 to 294 in aa6ce35

} else if cc.dopts.copts.FailOnNonTempDialError && s == connectivity.TransientFailure {
if err = cc.connectionError(); err != nil {
terr, ok := err.(interface {
Temporary() bool
})
if ok && !terr.Temporary() {
return nil, err
}
}
}

But the issue is that the error is wrapped without paying attention to the fact that it's a permanent one (obviously, we can't change the wrong certificate on the fly) and treated as temporary (see a screenshot of my debugger from above).

For reference, the tls.permanentError error implements the interface properly https://cs.opensource.google/go/go/+/master:src/crypto/tls/conn.go;l=192?q=permanentError&ss=go%2Fgo

@atollena
Copy link
Collaborator

Ha, I didn't realize you are also using FailOnNonTempDialError , I'm sorry. The flag itself doesn't make much sense to me, but your point about the flag respecting the underlying error definition does (Temporary() never had clear semantic always and was deprecated since).

@olegbespalov
Copy link
Author

Right, it could be a lousy naming of the issue. I'll try to adjust it.

Yeah, I see that the net.Error's Temporary is deprecated, but still, I see that there are many usages of just Temporary interface inside the grpc-go.

Also, the suggested fix works for me. At least I've done some testing after applying it to our codebase and successfully passed the grpc-go tests.

@olegbespalov olegbespalov changed the title Wrong TLS certificate doesn't cause the Dial's interruption grpc.FailOnNonTempDialError doesn't work with wrong TLS certificate Aug 30, 2023
@olegbespalov olegbespalov changed the title grpc.FailOnNonTempDialError doesn't work with wrong TLS certificate grpc.FailOnNonTempDialError doesn't fail with wrong TLS certificate Aug 30, 2023
@easwars
Copy link
Contributor

easwars commented Aug 31, 2023

With help from the community, we put together this doc about anti-patterns: https://github.com/grpc/grpc-go/blob/master/Documentation/anti-patterns.md.

Is not using the three dial options grpc.WithBlock(), grpc.FailOnNonTempDialError(true), and grpc.WithReturnConnectionError() an option for you?

And if you don't use them and instead rely on errors at RPC time, do you see the error that you expect?

@olegbespalov
Copy link
Author

olegbespalov commented Sep 1, 2023

Hi, @easwars!

Thank you for getting back to me. Yes, I saw the https://github.com/grpc/grpc-go/blob/master/Documentation/anti-patterns.md and found using these gRPC options is anti-pattern. I've started thinking about how we can re-design our API in the sense that it avoids using this anti-pattern, but unfortunately, the migration could take some time.

But I could see the case where FailOnNonTempDialError is still valuable, as I've described, when the user tries to connect with the wrong credentials, which is a non-temporary dial error. And for that particular one, it's good to tell the user earlier. Since it's experimental, I recognize the risks of using it, but we already use it and are faced with something that looks incorrect and could be fixed with one line, this approach seems preferable to re-designing the API.

As I commented, I'm happy to open PR with the fix, but I need more time to get familiar with the codebase to write the proper test (right now, all tests are passing with and without the fix). But on the other side, I don't know what's your vision around the FailOnNonTempDialError, like maybe you're going to deprecate it soon 🤷

@easwars
Copy link
Contributor

easwars commented Sep 5, 2023

@olegbespalov : We currently don't have any plans of marking FailOnNonTempDialError as deprecated.

If needed, I could also try to prepare a code/repo that shows the issue, or I could also try to write a test

Would it be possible for you to setup a repro or a test for this issue? That would be great.

I think I buy your argument that calls to connectionError should use the isTemporary() function to determine if it is a temporary error or not. Maybe we need to audit other callsites of connectionError as well.

If you have a PR with test(s), feel free to send that over as well. Thanks for reporting this issue.

@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@olegbespalov
Copy link
Author

Hey, @easwars !

I've opened PR #6639, but actually, before signing CLA, I have to double-check with my employer, and that happen only next week. Hopefully having this comment is enough to remove the stale label and the issue won't be closed 😅

@github-actions github-actions bot removed the stale label Sep 15, 2023
@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants