-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify/define clientv3 retry logic (fix broken retries) #8691
Comments
is this fixed? |
Yes. |
Hi. The Watch client still contains a failFast=False with a TODO to switch it to failFast=True. Was that meant to be part of this issue? Or is there an issue tracking that elsewhere? |
@gyuho thas is incorrect, see https://github.com/coreos/etcd/blob/master/clientv3/watch.go#L778 |
A search of the codebase also reveals a FailFast=false in the grpcproxy |
@devnev Indeed. I meant to say
Somehow got confused. So we do |
For watches in particular FailFast=False is problematic, as they usually do not have RPC timeouts. In our case the connections are failing for extended periods of times, but we cannot set a timeout on the request as it is a watch. However, because FailFast=False, we're also not given any indication that the watch is in fact broken. |
etcd watch API is not meant for detecting connection issues. Disconnect is handled in client balancer layer. We've added HTTP/2 keepalive and client balancer health checking (only available >= v3.2.10). Please try HTTP/2 keepalive ping. If it still doesn't work, file open a new issue. |
Now that retry logic is such a critical part of our balancer logic, we need document clearly when Go client retries its RPCs; currently we don't have any, other than some comments in
clientv3/retry.go
. Should be helpful for other client language bindings.release-3.2, as of f1d7dd8
Mutable/immutable RPCs share the same error handling logic:
rpctypes.EtcdError
type, then no retryrpctypes.ErrEmptyKey
,rpctypes.ErrNoSpace
,rpctypes. ErrTimeout
grpc/status.(*statusError)
type and its error code iscodes.Unavailable
which means the service is not currently available, then retryrpctypes.EtcdError
andgrpc/status.(*statusError)
errors are mutually exclusive.This only works with
grpc/grpc-go
v1.2.1.master branch, as of 764a0f7
grpc/grpc-go
upgraded to >v1.6.x, which changed the behavior of error handling.During network disconnections, error
clientv3.ErrNoAddrAvilable
can happen, and its error code iscodes.Unavailable
, so should be retried (same asgrpc.Errorf(codes.Unavailable, "there is no address available")
).However, due to the change in grpc-go,
transport.ErrStreamDrain
can be returned. etcd mutable RPCs should not be retried ontransport.ErrStreamDrain
and only retried onclientv3.ErrNoAddrAvilable
. This is fixed via #8335 (fix "put at most once", not in 3.2).Plus with health checking balancer #8545, now retry error handling logic is:
grpc/status.(*statusError)
type and its error code iscodes.Unavailable
, then mark unhealthy, endpoint-switch, wait for connection notify, and retrygrpc/status.(*statusError)
type and its error code iscodes.Unavailable
and the error message isthere is no address available
, then mark unhealthy, endpoint-switch, wait for connection notify, and retryrpctypes.EtcdError
type, then mark unhealthy, endpoint-switch, and exitrpctypes.ErrEmptyKey
,rpctypes.ErrNoSpace
,rpctypes. ErrTimeout
rpctypes.EtcdError
type, then mark unhealthy, endpoint-switch, and exit (proper handling is missing though)TODOs
Maintenance
API, and others (done via clientv3: clean up retry wrapper, remove all FailFast=false #8717)status.FromError
are all errors returned by gRPC wrapped by status.Status? grpc/grpc-go#1581rpctypes.EtcdError
type error handling should be consistent across mutable/immutable RPCsfunctional-tester/stresser
rpctypes.ErrEmptyKey
orrpctypes.ErrNoSpace
rpctypes.ErrTimeout
rpctypes.ErrEmptyKey
orrpctypes.ErrNoSpace
rpctypes.ErrTimeout
The text was updated successfully, but these errors were encountered: