Fix "deadline exceeded" issues discovered in conformance tests #643

jhump · 2023-11-29T21:26:56Z

When adding timeout tests to the conformance suite in connectrpc/conformance#715 and connectrpc/conformance#716, @smaye81 encountered several issues where the responses from the connect-go client would sometimes have a code of unavailable or unknown.

The unavailable code was happening here. The wrapIfContextError above it was not successful because, sadly, a timeout error during dialing can get returned as os.ErrDeadlineExceeded instead of context.DeadlineExceeded 😭. (I filed a Go bug for this and added a work-around to wrapIfContextError.)

The unknown codes were when a deadline exceeded error was returned from a call to write a message. The code was extremely inconsistent about handling of context errors and connect errors that could be returned from a call to Write. I've updated it so that all code that writes or reads now checks for context and connect errors, though many places (which read and write from a buffer or a pipe) can't actually trigger it. I figured it was safer (and more consistent) to have the checks everywhere instead of trying to be selective where it could or couldn't happen (which requires non-trivial analysis and could also be brittle if this code is changed in the future).

I tried creating a repro test case that reconstructed what was happening in the failing conformance test case. But I wasn't able to find a formulation that could actually do so consistently. So what I did to repro/test this was put the conformance tests in a loop, running them until they failed. Before these fixes, they semi-regularly failed (at least once every five runs). With this fix, I let it run over 500 times and still zero failures.

…r 'deadline exceeded' code instead of 'unknown' or 'unavailable'

jhump · 2023-11-29T21:27:42Z

compression.go

@@ -111,8 +115,12 @@ func (c *compressionPool) Compress(dst *bytes.Buffer, src *bytes.Buffer) *Error
 	if err != nil {
 		return errorf(CodeUnknown, "get compressor: %w", err)
 	}
-	if _, err := io.Copy(compressor, src); err != nil {
+	if _, err := src.WriteTo(compressor); err != nil {


This was just for consistency/symmetry with the decompress code above, which directly calls dst.ReadFrom instead of using io.Copy.

jhump · 2023-11-29T21:28:52Z

envelope.go

@@ -252,7 +261,7 @@ func (r *envelopeReader) Read(env *envelope) *Error {
 	if r.readMaxBytes > 0 && size > int64(r.readMaxBytes) {
 		_, err := io.CopyN(io.Discard, r.reader, size)
 		if err != nil && !errors.Is(err, io.EOF) {
-			return errorf(CodeUnknown, "read enveloped message: %w", err)
+			return errorf(CodeResourceExhausted, "message is larger than configured max %d - unable to determine message size: %w", r.readMaxBytes, err)


This was made consistent with the exact same logic that lives in the decompression code. Its handling (always returning "resource exhausted" since we know the message is too big) seemed better, and its error message more helpful, so I copy+pasted it here.

akshayjshah

LGTM. Before merging, please do respond to my question about os.ErrDeadlineExceeded.

Also, if the best way to flush this error out is to run the conformance tests repeatedly...should we do that on one platform in CI? If it's too painful to do on every PR, we could at least do a cron job and run it nightly.

akshayjshah · 2023-12-01T19:26:12Z

error.go

+	// https://github.com/golang/go/issues/64449
+	if errors.Is(err, os.ErrDeadlineExceeded) {
+		return NewError(CodeDeadlineExceeded, err)
+	}


Confirming: whenever we hit this case, the actual context has been cancelled - correct?

It cannot be guaranteed without also checking the context. However, the same is true of the check above: errors.Is(err, context.DeadlineExceeded) could also return true even if the context is not cancelled:

A network timeout that is not caused by a context cancellation will also return true with errors.Is(err. context.DeadlineExceeded): https://github.com/golang/go/blob/master/src/net/net.go#L600-L602

Handler code could create a child context with some other, arbitrary deadline, and that result in a context.DeadlineExceeded that does not correlate with the lifetime of the parent RPC context.

So this check does not make the situation any worse IMO. And, if the referenced Go bug is fixed, it would behave exactly this way, even with only the errors.Is(err, context.DeadlineExceeded) check above.

The only way to guarantee that the context has been cancelled is to also pass in the context and check ctx.Err(), too.

Also, if the best way to flush this error out is to run the conformance tests repeatedly...should we do that on one platform in CI?

While trying to create a repro test for the Go bug, I think I may have found a test formulation that can reveal this issue more consistently. It isn't pretty, since it's timing related, and different execution environments will have different timing that is likely to tickle it (10-core MBP vs. VM/container in CI infra). Let me put that together and verify it can repro this issue on main and then see what you think of it.

I just pushed a commit with a repro test. And... sadly... it fails on this branch. So there is some other place a deadline exceeded error is miscategorized as "unknown" 😢

Okay, I think I've now squashed all of the bugs.

The upside is that the test seemed to fail pretty consistently without all of these fixes, so I think it will be a reasonably good way to identify issues in CI. But there are a couple of downsides:

It never repro'ed the issues with the race detector enabled. So I have the test currently running only when the race detector is disabled. And so there is now an extra command run for make test that just runs this one test, without -race.

When there are no bugs, it slows down CI by a full 20 seconds, hoping to find a non-existent bug.

See the last three commits for details:

d5965e8 adds the test

39a162c updates the test so the output is more useful in tracking down the source of the error; this also makes sure to use the gRPC protocol, too, in case there protocol-specific places that fail to classify the error correctly. (It just uses gRPC-Web right now, since that is the same protocol handler implementation and basically a superset of activity, since it includes add'l body writes for the trailers.)

d42edba fixes the three add'l places found by the test that were incorrectly classifying errors. One of them was wrapping a *connect.Error with another *connect.Error and changing the code to "unknown". Oops!

One of them was wrapping a *connect.Error with another *connect.Error and changing the code to "unknown". Oops!

😬 Yikes!

…ror; also use gRPC protocol for broader coverage

emcfarlane

Potentially we could move these new test cases that relies on thrashing to a test on the writers and readers by controlling when errors are injected? This wouldn't help in uncovering new issues but we could build a suite of known error cases and protect against them. This would ensure the testcases are able to reproduce the issues under -race.

client_ext_norace_test.go

jhump · 2023-12-05T18:57:13Z

This wouldn't help in uncovering new issues but we could build a suite of known error cases and protect against them.

🤷. While the test would certainly be faster, it wouldn't be nearly as robust or confidence-inspiring. Also, I worry that adding the ability inject arbitrary errors into all of the different places could make the non-test code much more complicated.

jhump · 2023-12-06T01:04:44Z

@emcfarlane, @akshayjshah, since the test that reproduces the issue is slow, I've disabled it for now. It's actually enabled, but only when the race detector is off. And since the Makefile and CI always run the tests with -race, it's effectively a manual-only test.

Seem good enough to merge as is?

… have some coverage in CI

jhump · 2023-12-07T14:38:56Z

After a chat w/ Akshay, I added it as a separate CI job, so it runs in parallel with other tests, and only for latest Go and linux. Looking at the timing above, it is faster than the main ci job, so isn't making CI jobs otherwise take longer (so time to merge is unaffected), they just use a little more resources.

akshayjshah · 2023-12-07T17:40:37Z

.github/workflows/ci.yaml

+          # only the latest
+          go-version: 1.21.x
+      - name: Run Slow Tests
+        run: make slowtest


Talked to Josh live about this, but I'd love for this to be carved up a little differently. Locally, I'd really like make and make test to run these slower smoke tests. In CI, the highly-matrixed tests should run all the tests except the smoke tests. The new job (slowtest) should run the slow tests, but it would be okay with me if it also ran the faster tests.

I don't care that much whether we carve up the tests with an environment variable, testing.Short(), or with build tags.

make test now runs both short and slow tests. CI "Test" job only runs short tests, and there's a separate "Slow Tests" job for the others. Did this with go test -short and a check in the test (if testing.Short() { t.Skip(...) }).

Also, while factoring out slow tests to a parallel job, I did the same for conformance tests, to shorten the critical path through the CI workflow.

Because the main CI job must re-build the code (to instrument with race detector) and also does linting, the new slowtest job is actually faster than that other job (for now at least). So this new job does not increase the total duration. The longest job is still Windows, by a good margin.

make sure all context errors get correctly wrapped with 'cancelled' o…

ea2bb3e

…r 'deadline exceeded' code instead of 'unknown' or 'unavailable'

jhump commented Nov 29, 2023

View reviewed changes

jhump requested a review from akshayjshah November 29, 2023 21:29

Merge branch 'main' into jh/deadline-exceeded-issues

07ab64b

akshayjshah approved these changes Dec 1, 2023

View reviewed changes

jhump added 4 commits December 5, 2023 09:04

add test to repro issue

d5965e8

report more details to make it easier to investigate source of bad er…

39a162c

…ror; also use gRPC protocol for broader coverage

fix remaining issues found by tests

d42edba

make linter happy

199d17f

emcfarlane reviewed Dec 5, 2023

View reviewed changes

client_ext_norace_test.go Outdated Show resolved Hide resolved

client_ext_norace_test.go Outdated Show resolved Hide resolved

client_ext_norace_test.go Outdated Show resolved Hide resolved

client_ext_norace_test.go Outdated Show resolved Hide resolved

jhump added 3 commits December 5, 2023 19:56

Merge branch 'main' into jh/deadline-exceeded-issues

2711779

don't run these slow tests in CI on every commit

f895d34

review feedback

e06e630

run slow test in separate CI job, only for latest Go and linux, so we…

49f888c

… have some coverage in CI

akshayjshah reviewed Dec 7, 2023

View reviewed changes

jhump added 2 commits December 7, 2023 12:52

rearrange; 'make' runs slow tests

7aa1d1f

'make test' should also run slow tests

76ee0c6

jhump merged commit c9408f4 into main Dec 7, 2023
11 checks passed

jhump deleted the jh/deadline-exceeded-issues branch December 7, 2023 18:01

jhump added the bug Something isn't working label Dec 8, 2023

emcfarlane mentioned this pull request Jan 5, 2024

Update to connectrpc.com/connect v1.14.0 and google.golang.org/protobuf v1.32.0 connectrpc/vanguard-go#111

Merged

jhump mentioned this pull request May 28, 2024

conformance: client sometimes reports incorrect error code for canceled/deadline-exceeded conditions #745

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix "deadline exceeded" issues discovered in conformance tests #643

Fix "deadline exceeded" issues discovered in conformance tests #643

jhump commented Nov 29, 2023 •

edited

Loading

jhump Nov 29, 2023

jhump Nov 29, 2023

akshayjshah left a comment

akshayjshah Dec 1, 2023

jhump Dec 4, 2023 •

edited

Loading

jhump Dec 4, 2023

jhump Dec 5, 2023

jhump Dec 5, 2023

akshayjshah Dec 7, 2023

emcfarlane left a comment

jhump commented Dec 5, 2023

jhump commented Dec 6, 2023

jhump commented Dec 7, 2023

akshayjshah Dec 7, 2023

jhump Dec 7, 2023 •

edited

Loading

jhump Dec 7, 2023 •

edited

Loading

Fix "deadline exceeded" issues discovered in conformance tests #643

Fix "deadline exceeded" issues discovered in conformance tests #643

Conversation

jhump commented Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akshayjshah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhump Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emcfarlane left a comment

Choose a reason for hiding this comment

jhump commented Dec 5, 2023

jhump commented Dec 6, 2023

jhump commented Dec 7, 2023

Choose a reason for hiding this comment

jhump Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

jhump Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

jhump commented Nov 29, 2023 •

edited

Loading

jhump Dec 4, 2023 •

edited

Loading

jhump Dec 7, 2023 •

edited

Loading

jhump Dec 7, 2023 •

edited

Loading