Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "deadline exceeded" issues discovered in conformance tests #643

Merged
merged 12 commits into from
Dec 7, 2023
10 changes: 9 additions & 1 deletion compression.go
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,10 @@ func (c *compressionPool) Decompress(dst *bytes.Buffer, src *bytes.Buffer, readM
bytesRead, err := dst.ReadFrom(reader)
if err != nil {
_ = c.putDecompressor(decompressor)
err = wrapIfContextError(err)
if connectErr, ok := asError(err); ok {
return connectErr
}
return errorf(CodeInvalidArgument, "decompress: %w", err)
}
if readMaxBytes > 0 && bytesRead > readMaxBytes {
Expand All @@ -111,8 +115,12 @@ func (c *compressionPool) Compress(dst *bytes.Buffer, src *bytes.Buffer) *Error
if err != nil {
return errorf(CodeUnknown, "get compressor: %w", err)
}
if _, err := io.Copy(compressor, src); err != nil {
if _, err := src.WriteTo(compressor); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just for consistency/symmetry with the decompress code above, which directly calls dst.ReadFrom instead of using io.Copy.

_ = c.putCompressor(compressor)
err = wrapIfContextError(err)
if connectErr, ok := asError(err); ok {
return connectErr
}
return errorf(CodeInternal, "compress: %w", err)
}
if err := c.putCompressor(compressor); err != nil {
Expand Down
15 changes: 14 additions & 1 deletion envelope.go
Original file line number Diff line number Diff line change
Expand Up @@ -141,12 +141,17 @@ func (w *envelopeWriter) write(env *envelope) *Error {
prefix[0] = env.Flags
binary.BigEndian.PutUint32(prefix[1:5], uint32(env.Data.Len()))
if _, err := w.writer.Write(prefix[:]); err != nil {
err = wrapIfContextError(err)
if connectErr, ok := asError(err); ok {
return connectErr
}
return errorf(CodeUnknown, "write envelope: %w", err)
}
if _, err := io.Copy(w.writer, env.Data); err != nil {
err = wrapIfContextError(err)
if connectErr, ok := asError(err); ok {
return connectErr
}
return errorf(CodeUnknown, "write message: %w", err)
}
return nil
Expand Down Expand Up @@ -235,6 +240,10 @@ func (r *envelopeReader) Read(env *envelope) *Error {
// add any alarming text about protocol errors, though.
return NewError(CodeUnknown, err)
}
err = wrapIfContextError(err)
if connectErr, ok := asError(err); ok {
return connectErr
}
// Something else has gone wrong - the stream didn't end cleanly.
if connectErr, ok := asError(err); ok {
return connectErr
Expand All @@ -252,7 +261,7 @@ func (r *envelopeReader) Read(env *envelope) *Error {
if r.readMaxBytes > 0 && size > int64(r.readMaxBytes) {
_, err := io.CopyN(io.Discard, r.reader, size)
if err != nil && !errors.Is(err, io.EOF) {
return errorf(CodeUnknown, "read enveloped message: %w", err)
return errorf(CodeResourceExhausted, "message is larger than configured max %d - unable to determine message size: %w", r.readMaxBytes, err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was made consistent with the exact same logic that lives in the decompression code. Its handling (always returning "resource exhausted" since we know the message is too big) seemed better, and its error message more helpful, so I copy+pasted it here.

}
return errorf(CodeResourceExhausted, "message size %d is larger than configured max %d", size, r.readMaxBytes)
}
Expand All @@ -274,6 +283,10 @@ func (r *envelopeReader) Read(env *envelope) *Error {
readN,
)
}
err = wrapIfContextError(err)
if connectErr, ok := asError(err); ok {
return connectErr
}
return errorf(CodeUnknown, "read enveloped message: %w", err)
}
env.Flags = prefixes[0]
Expand Down
7 changes: 7 additions & 0 deletions error.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"fmt"
"net/http"
"net/url"
"os"
"strings"

"google.golang.org/protobuf/proto"
Expand Down Expand Up @@ -292,6 +293,12 @@ func wrapIfContextError(err error) error {
if errors.Is(err, context.DeadlineExceeded) {
return NewError(CodeDeadlineExceeded, err)
}
// Ick, some dial errors can be returned as os.ErrDeadlineExceeded
// instead of context.DeadlineExceeded :(
// https://github.com/golang/go/issues/64449
if errors.Is(err, os.ErrDeadlineExceeded) {
return NewError(CodeDeadlineExceeded, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming: whenever we hit this case, the actual context has been cancelled - correct?

Copy link
Member Author

@jhump jhump Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot be guaranteed without also checking the context. However, the same is true of the check above: errors.Is(err, context.DeadlineExceeded) could also return true even if the context is not cancelled:

  1. A network timeout that is not caused by a context cancellation will also return true with errors.Is(err. context.DeadlineExceeded): https://github.com/golang/go/blob/master/src/net/net.go#L600-L602
  2. Handler code could create a child context with some other, arbitrary deadline, and that result in a context.DeadlineExceeded that does not correlate with the lifetime of the parent RPC context.

So this check does not make the situation any worse IMO. And, if the referenced Go bug is fixed, it would behave exactly this way, even with only the errors.Is(err, context.DeadlineExceeded) check above.

The only way to guarantee that the context has been cancelled is to also pass in the context and check ctx.Err(), too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if the best way to flush this error out is to run the conformance tests repeatedly...should we do that on one platform in CI?

While trying to create a repro test for the Go bug, I think I may have found a test formulation that can reveal this issue more consistently. It isn't pretty, since it's timing related, and different execution environments will have different timing that is likely to tickle it (10-core MBP vs. VM/container in CI infra). Let me put that together and verify it can repro this issue on main and then see what you think of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit with a repro test. And... sadly... it fails on this branch. So there is some other place a deadline exceeded error is miscategorized as "unknown" 😢

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I've now squashed all of the bugs.

The upside is that the test seemed to fail pretty consistently without all of these fixes, so I think it will be a reasonably good way to identify issues in CI. But there are a couple of downsides:

  1. It never repro'ed the issues with the race detector enabled. So I have the test currently running only when the race detector is disabled. And so there is now an extra command run for make test that just runs this one test, without -race.
  2. When there are no bugs, it slows down CI by a full 20 seconds, hoping to find a non-existent bug.

See the last three commits for details:

  1. d5965e8 adds the test
  2. 39a162c updates the test so the output is more useful in tracking down the source of the error; this also makes sure to use the gRPC protocol, too, in case there protocol-specific places that fail to classify the error correctly. (It just uses gRPC-Web right now, since that is the same protocol handler implementation and basically a superset of activity, since it includes add'l body writes for the trailers.)
  3. d42edba fixes the three add'l places found by the test that were incorrectly classifying errors. One of them was wrapping a *connect.Error with another *connect.Error and changing the code to "unknown". Oops!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of them was wrapping a *connect.Error with another *connect.Error and changing the code to "unknown". Oops!

😬 Yikes!

return err
}

Expand Down
2 changes: 2 additions & 0 deletions protocol_connect.go
Original file line number Diff line number Diff line change
Expand Up @@ -939,6 +939,7 @@ func (m *connectUnaryMarshaler) Marshal(message any) *Error {

func (m *connectUnaryMarshaler) write(data []byte) *Error {
if _, err := m.writer.Write(data); err != nil {
err = wrapIfContextError(err)
if connectErr, ok := asError(err); ok {
return connectErr
}
Expand Down Expand Up @@ -1081,6 +1082,7 @@ func (u *connectUnaryUnmarshaler) UnmarshalFunc(message any, unmarshal func([]by
// ReadFrom ignores io.EOF, so any error here is real.
bytesRead, err := data.ReadFrom(reader)
if err != nil {
err = wrapIfContextError(err)
if connectErr, ok := asError(err); ok {
return connectErr
}
Expand Down