-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: fix potential panic during RPC retries #5323
Conversation
stream.go
Outdated
cs.finish(err) | ||
return nil, err | ||
} | ||
|
||
op := func(a *csAttempt) error { return a.newStream() } | ||
// Because this operation is always called either here (while creating the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment what this op does? That it create transport and stream, and only replaces the attempt if both are successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stream.go
Outdated
op := func(a *csAttempt) error { return a.newStream() } | ||
// Because this operation is always called either here (while creating the | ||
// clientStream) or by the retry code while locked when replaying the | ||
// operation, it is safe to access cs.attempt directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this comment right above cs.attempt = a
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return err | ||
} | ||
if lastErr = cs.replayBufferLocked(); lastErr == nil { | ||
if lastErr = cs.replayBufferLocked(attempt); lastErr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment that the first op will set the attempt if it gets a transport and a stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes cockroachdb#80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.
Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes cockroachdb#80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.
Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. All comments addressed.
return err | ||
} | ||
if lastErr = cs.replayBufferLocked(); lastErr == nil { | ||
if lastErr = cs.replayBufferLocked(attempt); lastErr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stream.go
Outdated
cs.finish(err) | ||
return nil, err | ||
} | ||
|
||
op := func(a *csAttempt) error { return a.newStream() } | ||
// Because this operation is always called either here (while creating the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stream.go
Outdated
op := func(a *csAttempt) error { return a.newStream() } | ||
// Because this operation is always called either here (while creating the | ||
// clientStream) or by the retry code while locked when replaying the | ||
// operation, it is safe to access cs.attempt directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Just curious if this could get backported to 1.46.x or is the change too large. |
80476: security: update the TLS cipher suite list r=bdarnell a=knz Fixes #80483 This does not really change the list, it merely explains more clearly how it was built. Release note: None 80635: release: use preflight tool for RedHat images r=celiala a=rail RedHat Connect has introduced a new certification workflow, which requires running some tests locally and submit the results back to RedHat. This patch runs the `preflight` tool to address the new workflow changes. Fixes #80633 Release note: None 80878: util/tracing: fix crash in StreamClientInterceptor r=andreimatei a=andreimatei Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go#5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes #80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed. Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Rail Aliiev <[email protected]> Co-authored-by: Andrei Matei <[email protected]>
Yes, I believe this should be backported. |
Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.
Quick question: in the case of server returns |
|
Fixes #5315
Before this change, there was a very narrow window where a stream could fail to be created on a retry attempt, and subsequent stream operations could panic when this occurred. This change is a fairly substantial restructuring of the retry code in order to guarantee that for all clientStreams returned to the user,
cs.attempt.s
is always non-nil. New attempts that don't yet have a stream are created incs.newAttempt
until they pick a transport and create a stream, at which time they are promoted tocs.attempt
.This also allows us to run balancer pick results through the retry code, meaning UNAVAILABLE errors can be retried by the retry policy, as intended. It also handles drop results that should explicitly avoid the retry code.
RELEASE NOTES: