Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC code Unavailable instead Canceled #10289

Closed
arslanbekov opened this issue Nov 29, 2018 · 29 comments · Fixed by #12196
Closed

gRPC code Unavailable instead Canceled #10289

arslanbekov opened this issue Nov 29, 2018 · 29 comments · Fixed by #12196

Comments

@arslanbekov
Copy link

arslanbekov commented Nov 29, 2018

After this PR #9178 we get:

It's in etcd running with parameter --log-package-levels=debug:

...
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL")
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 78; CANCEL")
...

in my prometheus i see:

sum by(grpc_type, grpc_service, grpc_code) (grpc_server_handled_total{grpc_code="Unavailable", job="etcd", grpc_type="bidi_stream"}) = 4172

But it seems to me, this not error, and should not fall under Unavailable Code

Unavailable indicates the service is currently unavailable. This is a most likely a transient condition and may be corrected by retrying with a backoff. See litmus test above for deciding between FailedPrecondition, Aborted, and Unavailable.

It looks like Canceled Code

Canceled indicates the operation was canceled (typically by the caller).

Now everyone who uses prometheus operator + alertmanager, get this alert, because CANCEL falls under Unavailable

#9725, #9576, #9166 openshift/origin#20311, prometheus-operator/prometheus-operator#2046, and in google groups

@xiang90
Copy link
Contributor

xiang90 commented Nov 29, 2018

@arslanbekov From #9178, it seems that gRPC returned code is unavailable, but the description is canceled. Can you open an issue in grcp-go to ask?

@arslanbekov
Copy link
Author

@xiang90

but the description is canceled

are you talking about it?
https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/util.go#L113

I will create issue in grpc-go, thanks

@xiang90
Copy link
Contributor

xiang90 commented Nov 29, 2018

yes. thank you.

@menghanl
Copy link

menghanl commented Dec 3, 2018

The error string is generated from x/net/http2. The error description comes from method Error().

The error conversion should be handled by mapRecvMsgError, which also calls Error().

Can you add a print to mapRecvMsgError to print the error type and error details?
Something like fmt.Printf("%T %v", err, err). Thanks.

@hexfusion
Copy link
Contributor

@arslanbekov just touching base on this did you see @menghanl (grpc-go maintainer) comment above?

@arslanbekov
Copy link
Author

Yes, @hexfusion thanks for notifying, there is no free time to do it now. Maybe someone will help?

@hexfusion hexfusion self-assigned this Dec 7, 2018
@hexfusion
Copy link
Contributor

Yes, @hexfusion thanks for notifying, there is no free time to do it now. Maybe someone will help?

@arslanbekov I will try to take a look soon.

@menghanl
Copy link

@hexfusion Did you get time to try this? Thanks!

@hexfusion
Copy link
Contributor

@menghanl thanks for the reminder, I will work on it before the end of the week.

@hexfusion
Copy link
Contributor

@spzala could you please take a look?

@spzala
Copy link
Member

spzala commented Jan 21, 2019

@hexfusion I sure will, thanks!

@spzala
Copy link
Member

spzala commented Jan 27, 2019

@hexfusion I sure will, thanks!

Quick update - I had to spend some time creating a cluster with tls but I have it created now, so should be providing more update soon. Thanks!

@spzala
Copy link
Member

spzala commented Jan 30, 2019

So not sure if I am doing something incorrect or in a different way but I could not reproduce the issue exactly but I do see rpc error: code = Unavailable (i.e. https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/util.go#L108 instead of L113 as commented in the issue above)
This is how I am testing:
Run this on server:
./bin/etcd --data-dir=data.etcd --name ${ETCD_NAME} --advertise-client-urls ${ADVERTISE_CLIENT_URLS} --listen-client-urls ${LISTEN_CLIENT_URLS} --initial-cluster ${INITIAL_CLUSTER} --initial-cluster-state ${INITIAL_CLUSTER_STATE} --initial-cluster-token ${INITIAL_CLUSTER_TOKEN} --cert-file $CERT_FILE --trusted-ca-file $TRUSTED_CA_FILE --key-file $KEY_FILE --debug
Run below commands on client (a separate VM) on two diff cli:
./bin/etcdctl --endpoints https://<ip>:2379 watch foo --cacert /root/cfssl/ca.pem --cert /root/cfssl/client.pem --key /root/cfssl/client-key.pem
./bin/etcdctl --endpoints https://<ip>:2379 put foo bar --cacert /root/cfssl/ca.pem --cert /root/cfssl/client.pem --key /root/cfssl/client-key.pem
And then cancel the watch command (ctrl+c) which gives me following error most of the time,
2019-01-30 03:35:31.618639 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = client disconnected”)

Also, @menghanl thanks but I couldn't see the print output in the etcd log after I add the line to mapRecvMsgError. Is it logged somewhere other than etcd server output? I also tried adding GRPC_GO_LOG_VERBOSITY_LEVEL=2 and other flags in env. Since I couldn't reproduce issue as explained above, I also tried couple of other Println in different package functions just to play around but didn't see anything in the log. Thanks!

@menghanl
Copy link

@spzala
Re logging:
Did you call grpclog.Info (or other grpclog functions)? If so:

  • If grpclogger is the default, then GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info should print everything.
  • If grpclogger is replaced (by calling SetLogger or SetLoggerV2), then you need to configure the custom logger to log everything.

If you called fmt.Print, it's logged to stdout, and gRPC doesn't do anything special to that.


The function (mapRecvMsgError) is just where I guess the problem is. If nothing is printed, it's totally possible that the bug happens in another codepath. But we can only confirm that when we reproduce the bug.

In the reproducing command, what should be the values of the configs (like ${ADVERTISE_CLIENT_URLS})? I will try to run it myself, but I'm not very familiar with etcd. Thanks!

@spzala
Copy link
Member

spzala commented Jan 30, 2019

@menghanl thanks much for the great detail!! and yes etcds uses SetLoggerV2 so that might be the reason. I will run again modifying it not to override default grpclog. Also for testing I used both, fmt.Printf and fmt.Println so as you guessed the problem might be something else but I will try again with grpclogger.

About the etcd server flags and their values - for the ones that I used here, I am copying below but you can find those and more flags at https://github.com/etcd-io/etcd/blob/master/etcdmain/help.go. With tls enabled, I have used https instead of http e.g. ADVERTISE_CLIENT_URLS=${HOST}:2379 where HOST is https://<fqdn>

--advertise-client-urls 'http://localhost:2379'
List of this member's client URLs to advertise to the public.
--listen-client-urls 'http://localhost:2379'
List of URLs to listen on for client traffic.
--initial-cluster 'default=http://localhost:2380'
Initial cluster configuration for bootstrapping.
--initial-cluster-state 'new'
Initial cluster state ('new' or 'existing').
--initial-cluster-token 'etcd-cluster'
Initial cluster token for the etcd cluster during bootstrap.
Specifying this can protect you from unintended cross-cluster interaction when running multiple clusters.
--cert-file ''
Path to the client server TLS cert file.
--key-file ''
Path to the client server TLS key file.

For the etcd client - https://github.com/etcd-io/etcd/tree/master/etcdctl

--cacert=/tmp/ca.pem			
verify certificates of TLS-enabled secure servers using this CA bundle
--cert=/tmp/cert.pem				
identify secure client using this TLS certificate file
-key=/tmp/key.pem
identify secure client using this TLS key file

@jingyih
Copy link
Contributor

jingyih commented Feb 11, 2019

cc @jpbetz

@hexfusion
Copy link
Contributor

Thanks for digging @spzala I am going to dig on this further.

@spzala
Copy link
Member

spzala commented Mar 5, 2019

Thanks @hexfusion !!

@hexfusion
Copy link
Contributor

Update: @menghanl was right about the source of the issue the actual inbound error to mapRecvMsgError is.

rpc error: code = Unknown desc = client disconnected

I will connect the rest of the dots and come up with some resolution hopefully soon.

@hexfusion
Copy link
Contributor

hexfusion commented Apr 29, 2019

I have thought about this for a while and although it does feel like there could be a middle ground between Unknown and Unavailable error codes in this situation. For example by allowing Unknown to conditionally remain as the error code similar to [1] vs forcing Unavailable. If we look at the gRPC spec [1] the client generated Unavailable seems sane vs server. So perhaps this is not a matter of using the wrong error code but a result of gRPC having 2 server transports [3]? @menghanl curious your thoughts?

/cc @brancz

Some data transmitted (e.g., request metadata written to TCP connection) before connection breaks UNAVAILABLE Client
Server shutting down ...................................................................................... UNAVAILABLE Server

[1]https://github.com/grpc/grpc-go/blob/66cd5249103cfe4638864f08462537504b07ff1d/internal/transport/handler_server.go#L426
[2] https://github.com/grpc/grpc/blob/master/doc/statuscodes.md
[3] grpc/grpc-go#1819 (comment) (won't fix)

@menghanl
Copy link

@hexfusion Thanks for the findings!

actual inbound error to mapRecvMsgError is.
rpc error: code = Unknown desc = client disconnected

Do you know where this error is from? This looks like an error returned from a service handler with fmt.Errorf, instead of status.Errorf with an error code.
Also, the error reported originally is rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL, which has different description.


I think the issue raised here is not about which error code to return.
It more about the confusion caused by error code Unavailable while the error description says CANCEL.

What we want to figure out is why (how) rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL is generated.

@hexfusion
Copy link
Contributor

hexfusion commented Apr 30, 2019

@menghanl thanks for the notes, your questions are valid, I will prioritize some time to get you those answers.

brancz added a commit to brancz/cluster-monitoring-operator that referenced this issue Apr 30, 2019
These alerts are firing constantly due to some issues around how to etcd
increases its metrics. See etcd-io/etcd#10289
@darwink
Copy link

darwink commented Sep 4, 2019

@hexfusion Is there any progress about this issue ? Certain etcd3 alert rules are being triggered because of this issue https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/etcd3_alert.rules.yml#L32 ?

Any update or plan to fix it?

@siwyd
Copy link

siwyd commented Nov 13, 2019

FYI, for those using the etcd mixin, I worked around this by tweaking the etcd selector:

        local etcd = import "etcd-mixin/mixin.libsonnet";

        etcd {
          _config+:: {
            etcd_selector: 'job="etcd",grpc_service!="etcdserverpb.Watch"'
          },
        }

@hexfusion
Copy link
Contributor

I am picking this back up now, hopefully for the last time to resolve.

@hexfusion
Copy link
Contributor

hexfusion commented Nov 19, 2019

I have found the source of Unavailable instead of Canceled.

https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/watch.go#L198-L201

When the watch is canceled we pass error

ErrGRPCNoLeader = status.New(codes.Unavailable, "etcdserver: no leader").Err()

Which is not valid resulting in codes.Unavailable passed to status.New() leading to metrics reporting Unavailable. In my local testing adding a new error using codes.Canceled allows the metrics to properly report.

I will raise a PR in a bit to resolve.

howardjohn added a commit to howardjohn/istio that referenced this issue Jan 2, 2020
Right now when Envoy shuts down it sometimes sends an Unavailable error
code instead of Cancelled - see
etcd-io/etcd#10289 for details, but its pretty
complex. The end result is this shows up as "internal error" in the
dashboards, when there is nothing going wrong. Instead, we should filter
these out.

Checking the string directly feels a bit like a hack, but etcd does it
so seems reasonable:
https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/util.go#L113
istio-testing pushed a commit to istio/istio that referenced this issue Jan 3, 2020
* Stop logging client disconnect as a pilot error

Right now when Envoy shuts down it sometimes sends an Unavailable error
code instead of Cancelled - see
etcd-io/etcd#10289 for details, but its pretty
complex. The end result is this shows up as "internal error" in the
dashboards, when there is nothing going wrong. Instead, we should filter
these out.

Checking the string directly feels a bit like a hack, but etcd does it
so seems reasonable:
https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/util.go#L113

* Add deadline exceeded
@imunhatep
Copy link

Any news on merging fixes?!

ironcladlou added a commit to ironcladlou/etcd that referenced this issue Aug 3, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

This patch improves the behavior by:

1. Performing a deeper analysis during stream closure to more conclusively
determine that a leader has actually been lost before propagating a
ErrGRPCNoLeader error.

2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn
regarding leader loss.

There remains an assumption that absence of leader loss evidence represents a
client cancellation, but in practice this seems less likely to break down
whereas client cancellations are frequent and expected.

This is a continuation of the work already done in etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Aug 3, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

This patch improves the behavior by:

1. Performing a deeper analysis during stream closure to more conclusively
determine that a leader has actually been lost before propagating a
ErrGRPCNoLeader error.

2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn
regarding leader loss.

There remains an assumption that absence of evidence of leader loss means a
client cancelled, but in practice this seems less likely to break down whereas
client cancellations are frequent and expected.

This is a continuation of the work already done in etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Aug 5, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

This patch improves the behavior by:

1. Performing a deeper analysis during stream closure to more conclusively
determine that a leader has actually been lost before propagating a
ErrGRPCNoLeader error.

2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn
regarding leader loss.

There remains an assumption that absence of evidence of leader loss means a
client cancelled, but in practice this seems less likely to break down whereas
client cancellations are frequent and expected.

This is a continuation of the work already done in etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Sep 29, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Sep 29, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Sep 30, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Sep 30, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Nov 18, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Nov 18, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
@immanuelfodor
Copy link

This could solve some ongoing issues with Prometheus monitoring, will this PR be ever merged? 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment