etcdserver: fix watch metrics #11375

hexfusion · 2019-11-20T05:09:30Z

Currently, when a client closes context during watch we pass. codes.Unavailable to status.New() via rpctypes.ErrGRPCNoLeader[1],[2] this inadvertently registers Unavailable in Prometheus metrics which causes an issue as Unavailable indicates the service is currently unavailable [3]. This PR changes the logic for how we conclude the leader is lost by observing RaftStatusGetter.Leader()[4] for raft.None. Only then do we return Unavailable (no leader) otherwise Canceled.

Fixes #10289 #9725 #9576 #9166

[1] https://github.com/etcd-io/etcd/pull/11375/files#diff-8a4ebdea7c0a8a8926fca73c3058b0b9L200
[2] -

etcd/etcdserver/api/v3rpc/rpctypes/error.go

Line 68 in 0fb26df

    
           ErrGRPCNoLeader                   = status.New(codes.Unavailable, "etcdserver: no leader").Err()

[3] https://github.com/grpc/grpc-go/blob/master/codes/codes.go#L140
[4] -

etcd/etcdserver/server.go

Line 1907 in bbe1e78

Leader() types.ID

codecov-io · 2019-11-22T20:57:20Z

Codecov Report

Merging #11375 into master will increase coverage by 0.2%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #11375     +/-   ##
=========================================
+ Coverage   64.11%   64.31%   +0.2%     
=========================================
  Files         403      403             
  Lines       37973    37996     +23     
=========================================
+ Hits        24348    24439     +91     
+ Misses      11995    11929     -66     
+ Partials     1630     1628      -2

Impacted Files	Coverage Δ
etcdserver/api/v3rpc/rpctypes/error.go	`90.47% <ø> (ø)`	⬆️
etcdserver/api/v3rpc/watch.go	`80.06% <100%> (+1.63%)`	⬆️
etcdserver/api/v3rpc/lease.go	`67.04% <0%> (-7.96%)`	⬇️
auth/store.go	`44.82% <0%> (-2.56%)`	⬇️
lease/leasehttp/http.go	`64.23% <0%> (-1.46%)`	⬇️
etcdserver/api/v2http/client.go	`84.3% <0%> (-1.21%)`	⬇️
pkg/proxy/server.go	`60.2% <0%> (-1.02%)`	⬇️
etcdserver/v3_server.go	`72.86% <0%> (-0.86%)`	⬇️
mvcc/watchable_store.go	`82.51% <0%> (-0.7%)`	⬇️
mvcc/metrics_txn.go	`100% <0%> (ø)`	⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ec52217...91042e2. Read the comment docs.

hexfusion · 2019-11-22T23:46:00Z

/cc @brancz

Signed-off-by: Sam Batschelet <[email protected]>

hexfusion · 2019-11-23T01:05:01Z

/cc @jingyih @gyuho PTAL

brancz · 2019-11-23T17:20:22Z

I have very little idea about the code changes, they look fine to me but I really don’t know the code very well. If it does what’s promised then I’m extremely excited to finally turn the alerts back on! :)

hexfusion · 2019-11-26T23:25:23Z

@gyuho would you mind taking a peek please:).

hexfusion · 2019-12-02T22:55:08Z

@xiang90 would you mind looking please.

xiang90 · 2019-12-02T23:03:48Z

etcdserver/api/v3rpc/watch.go

 		if err == context.Canceled {
-			err = rpctypes.ErrGRPCNoLeader
+			md, ok := metadata.FromIncomingContext(stream.Context())


can we add a test case for this change?

xiang90 · 2019-12-02T23:05:17Z

etcdserver/api/v3rpc/watch.go

 	select {
 	case err = <-errc:
 		close(sws.ctrlStream)

 	case <-stream.Context().Done():
 		err = stream.Context().Err()
-		// the only server-side cancellation is noleader for now.


is there a better way to decide if the cancel is from server side or from client side?

yeah, good question I dug pretty hard but let me review again.

metost · 2020-02-24T13:00:49Z

Hi guys, any update on this? Thanks in advance!

philips · 2020-04-29T12:37:15Z

bump @hexfusion. There are TODOs on this PR from @xiang90’s feedback.

hexfusion · 2020-04-29T12:38:31Z

I hope to get back to this soon.

codecov-commenter · 2020-05-27T11:58:46Z

Codecov Report

Merging #11375 into master will decrease coverage by 0.64%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #11375      +/-   ##
==========================================
- Coverage   64.96%   64.31%   -0.65%     
==========================================
  Files         403      403              
  Lines       37969    37996      +27     
==========================================
- Hits        24666    24439     -227     
- Misses      11680    11929     +249     
- Partials     1623     1628       +5

Impacted Files	Coverage Δ
etcdserver/api/v3rpc/rpctypes/error.go	`90.47% <ø> (ø)`
etcdserver/api/v3rpc/watch.go	`80.06% <100.00%> (-1.97%)`	⬇️
auth/store.go	`44.82% <0.00%> (-23.50%)`	⬇️
proxy/grpcproxy/register.go	`69.44% <0.00%> (-11.12%)`	⬇️
etcdserver/api/v3rpc/lease.go	`67.04% <0.00%> (-7.96%)`	⬇️
pkg/tlsutil/tlsutil.go	`86.20% <0.00%> (-6.90%)`	⬇️
pkg/fileutil/purge.go	`66.00% <0.00%> (-6.00%)`	⬇️
proxy/grpcproxy/watcher.go	`89.79% <0.00%> (-4.09%)`	⬇️
clientv3/maintenance.go	`40.81% <0.00%> (-2.05%)`	⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63dd73c...91042e2. Read the comment docs.

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. This patch improves the behavior by: 1. Performing a deeper analysis during stream closure to more conclusively determine that a leader has actually been lost before propagating a ErrGRPCNoLeader error. 2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn regarding leader loss. There remains an assumption that absence of leader loss evidence represents a client cancellation, but in practice this seems less likely to break down whereas client cancellations are frequent and expected. This is a continuation of the work already done in etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. This patch improves the behavior by: 1. Performing a deeper analysis during stream closure to more conclusively determine that a leader has actually been lost before propagating a ErrGRPCNoLeader error. 2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn regarding leader loss. There remains an assumption that absence of evidence of leader loss means a client cancelled, but in practice this seems less likely to break down whereas client cancellations are frequent and expected. This is a continuation of the work already done in etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

ironcladlou · 2020-08-03T21:17:29Z

I'm picking this up in #12196

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. This patch improves the behavior by: 1. Performing a deeper analysis during stream closure to more conclusively determine that a leader has actually been lost before propagating a ErrGRPCNoLeader error. 2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn regarding leader loss. There remains an assumption that absence of evidence of leader loss means a client cancelled, but in practice this seems less likely to break down whereas client cancellations are frequent and expected. This is a continuation of the work already done in etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. The commit 9c103dd introduced an interceptor which wraps watch streams requiring a leader, causing those streams to be actively canceled when leader loss is detected. However, the error handling code assumes all stream context cancellations are from the interceptor. This assumption is broken when the context was canceled because of a client stream cancelation. The core challenge is lack of information conveyed via `context.Context` which is shared by both the send and receive sides of the stream handling and is subject to cancellation by all paths (including the gRPC library itself). If any piece of the system cancels the shared context, there's no way for a context consumer to understand who cancelled the context or why. To solve the ambiguity of the stream interceptor code specifically, this patch introduces a custom context struct which the interceptor uses to expose a custom error through the context when the interceptor decides to actively cancel a stream. Now the consuming side can more safely assume a generic context cancellation can be propagated as a cancellation, and the server generated leader error is preserved and propagated normally without any special inference. When a client cancels the stream, there remains a race in the error handling code between the send and receive goroutines whereby the underlying gRPC error is lost in the case where the send path returns and is handled first, but this issue can be taken separately as no matter which paths wins, we can detect a generic cancellation. This is a replacement of etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

stale · 2020-11-01T22:39:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. The commit 9c103dd introduced an interceptor which wraps watch streams requiring a leader, causing those streams to be actively canceled when leader loss is detected. However, the error handling code assumes all stream context cancellations are from the interceptor. This assumption is broken when the context was canceled because of a client stream cancelation. The core challenge is lack of information conveyed via `context.Context` which is shared by both the send and receive sides of the stream handling and is subject to cancellation by all paths (including the gRPC library itself). If any piece of the system cancels the shared context, there's no way for a context consumer to understand who cancelled the context or why. To solve the ambiguity of the stream interceptor code specifically, this patch introduces a custom context struct which the interceptor uses to expose a custom error through the context when the interceptor decides to actively cancel a stream. Now the consuming side can more safely assume a generic context cancellation can be propagated as a cancellation, and the server generated leader error is preserved and propagated normally without any special inference. When a client cancels the stream, there remains a race in the error handling code between the send and receive goroutines whereby the underlying gRPC error is lost in the case where the send path returns and is handled first, but this issue can be taken separately as no matter which paths wins, we can detect a generic cancellation. This is a replacement of etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

adampl · 2020-11-20T10:09:22Z

I guess it shouldn't go stale, not to mention closing.

abhi1693 · 2021-01-11T13:47:11Z

@xiang90 @hexfusion @ironcladlou Any news on merging this PR?

ptabor · 2021-02-16T08:40:43Z

Closing as this got replaced by: #12196

hexfusion added the WIP label Nov 20, 2019

hexfusion force-pushed the fx_w branch from 6275a91 to 4170606 Compare November 22, 2019 20:21

hexfusion force-pushed the fx_w branch 4 times, most recently from 4ff0b9c to aa1c165 Compare November 22, 2019 23:04

hexfusion removed the WIP label Nov 22, 2019

hexfusion changed the title ~~WIP etcdserver: fix watch metrics~~ etcdserver: fix watch metrics Nov 22, 2019

etcdserver: fix watch metrics reporting when context is closed

91042e2

Signed-off-by: Sam Batschelet <[email protected]>

hexfusion force-pushed the fx_w branch from aa1c165 to 91042e2 Compare November 23, 2019 01:03

hexfusion added backport/v3.3 backport/v3.4 Release-Note labels Nov 23, 2019

xiang90 reviewed Dec 2, 2019

View reviewed changes

ironcladlou mentioned this pull request Aug 3, 2020

etcdserver: fix incorrect metrics generated when clients cancel watches #12196

Merged

stale bot added the stale label Nov 1, 2020

stale bot removed the stale label Nov 20, 2020

ptabor closed this Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdserver: fix watch metrics #11375

etcdserver: fix watch metrics #11375

hexfusion commented Nov 20, 2019 •

edited

Loading

codecov-io commented Nov 22, 2019 •

edited

Loading

hexfusion commented Nov 22, 2019

hexfusion commented Nov 23, 2019

brancz commented Nov 23, 2019

hexfusion commented Nov 26, 2019

hexfusion commented Dec 2, 2019

xiang90 Dec 2, 2019

hexfusion Dec 2, 2019

xiang90 Dec 2, 2019

hexfusion Dec 2, 2019

metost commented Feb 24, 2020

philips commented Apr 29, 2020

hexfusion commented Apr 29, 2020

codecov-commenter commented May 27, 2020

ironcladlou commented Aug 3, 2020

stale bot commented Nov 1, 2020

adampl commented Nov 20, 2020

abhi1693 commented Jan 11, 2021

ptabor commented Feb 16, 2021 •

edited

Loading

etcdserver: fix watch metrics #11375

etcdserver: fix watch metrics #11375

Conversation

hexfusion commented Nov 20, 2019 • edited Loading

codecov-io commented Nov 22, 2019 • edited Loading

Codecov Report

hexfusion commented Nov 22, 2019

hexfusion commented Nov 23, 2019

brancz commented Nov 23, 2019

hexfusion commented Nov 26, 2019

hexfusion commented Dec 2, 2019

xiang90 Dec 2, 2019

Choose a reason for hiding this comment

hexfusion Dec 2, 2019

Choose a reason for hiding this comment

xiang90 Dec 2, 2019

Choose a reason for hiding this comment

hexfusion Dec 2, 2019

Choose a reason for hiding this comment

metost commented Feb 24, 2020

philips commented Apr 29, 2020

hexfusion commented Apr 29, 2020

codecov-commenter commented May 27, 2020

Codecov Report

ironcladlou commented Aug 3, 2020

stale bot commented Nov 1, 2020

adampl commented Nov 20, 2020

abhi1693 commented Jan 11, 2021

ptabor commented Feb 16, 2021 • edited Loading

hexfusion commented Nov 20, 2019 •

edited

Loading

codecov-io commented Nov 22, 2019 •

edited

Loading

ptabor commented Feb 16, 2021 •

edited

Loading