Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3/balancer: handle network partition in health check #8669

Merged
merged 5 commits into from
Oct 9, 2017

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Oct 9, 2017

Only health-check when timeout elapses since last failure.
Otherwise network-partitioned member with active health-check
server would not be gray-listed, making health-balancer stuck
with isolated endpoint. Also clarifies some log messages.

Mark partitioned member as unhealthy in retry code paths.

Previous behavior is when server returns errors, retry
wrapper does not do anything, while passively expecting
balancer to gray-list the isolated endpoint. This is
problematic when multiple endpoints are passed, and
network partition happens.

This patch adds endpointError method to balancer interface
to actively(possibly even before health-check API gets called)
handle RPC errors and gray-list endpoints for the time being,
thus speeding up the endpoint switch.

This is safe in a single-endpoint case, because balancer will
retry no matter what in such case.

Added integration tests fail without this patch.
Also manually tested to confirm this fixes #8660.

Logs

GET

2017-10-09 13:55:28.265074433 +0000 UTC m=+148.050834478 sleeping...
2017-10-09 13:55:33.265267969 +0000 UTC m=+153.051028081 starts get...
INFO: 2017/10/09 13:55:38 clientv3/retry: error rpc error: code = DeadlineExceeded desc = context deadline exceeded on pinned endpoint 10.138.0.21:2379
INFO: 2017/10/09 13:55:38 clientv3/health-balancer: marking 10.138.0.21:2379 as unhealthy (rpc error: code = DeadlineExceeded desc = context deadline exceeded)
2017-10-09 13:55:38.26572891 +0000 UTC m=+158.051488955 failed context deadline exceeded

2017-10-09 13:55:38.265765899 +0000 UTC m=+158.051525944 sleeping...
INFO: 2017/10/09 13:55:39 clientv3/balancer: unpin 10.138.0.21:2379 (grpc: the connection is drained)
INFO: 2017/10/09 13:55:39 clientv3/health-balancer: 10.138.0.21:2379 becomes unhealthy (grpc: the connection is drained)
INFO: 2017/10/09 13:55:39 clientv3/balancer: pin 10.138.0.22:2379
INFO: 2017/10/09 13:55:39 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.22:2379)
INFO: 2017/10/09 13:55:42 clientv3/health-balancer: removes 10.138.0.21:2379 from unhealthy after 3s
2017-10-09 13:55:43.265951664 +0000 UTC m=+163.051711812 starts get...
2017-10-09 13:55:43.269703717 +0000 UTC m=+163.055463762 success!

PUT

INFO: 2017/10/09 13:58:00 clientv3/balancer: pin 10.138.0.22:2379
INFO: 2017/10/09 13:58:00 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.22:2379)
INFO: 2017/10/09 13:58:00 clientv3/balancer: 10.138.0.21:2379 is up but not pinned (already pinned 10.138.0.22:2379)

2017-10-09 13:58:03.830506173 +0000 UTC m=+3.002894666 sleeping
2017-10-09 13:58:08.83076711 +0000 UTC m=+8.003155592 starts put...
2017-10-09 13:58:08.83494924 +0000 UTC m=+8.007337654 success!

2017-10-09 13:58:18.842551525 +0000 UTC m=+18.014939934 sleeping
2017-10-09 13:58:23.842763057 +0000 UTC m=+23.015151547 starts put...
INFO: 2017/10/09 13:58:28 clientv3/retry: error rpc error: code = DeadlineExceeded desc = context deadline exceeded on pinned endpoint 10.138.0.22:2379
INFO: 2017/10/09 13:58:28 clientv3/health-balancer: marking 10.138.0.22:2379 as unhealthy (rpc error: code = DeadlineExceeded desc = context deadline exceeded)
2017-10-09 13:58:28.843253482 +0000 UTC m=+28.015641904 failed: context deadline exceeded

2017-10-09 13:58:28.84326634 +0000 UTC m=+28.015654761 sleeping
INFO: 2017/10/09 13:58:30 clientv3/balancer: unpin 10.138.0.22:2379 (grpc: the connection is drained)
INFO: 2017/10/09 13:58:30 clientv3/health-balancer: 10.138.0.22:2379 becomes unhealthy (grpc: the connection is drained)
INFO: 2017/10/09 13:58:30 clientv3/balancer: pin 10.138.0.21:2379
WARNING: 2017/10/09 13:58:30 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/09 13:58:30 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
INFO: 2017/10/09 13:58:33 clientv3/health-balancer: removes 10.138.0.22:2379 from unhealthy after 3s
2017-10-09 13:58:33.843439088 +0000 UTC m=+33.015827591 starts put...
2017-10-09 13:58:33.846525885 +0000 UTC m=+33.018914309 success!

@gyuho gyuho requested a review from xiang90 October 9, 2017 16:16
@gyuho gyuho force-pushed the balancer branch 2 times, most recently from bd030f2 to 1286a4e Compare October 9, 2017 16:32
@@ -44,6 +44,8 @@ type balancer interface {
endpoints() []string
// pinned returns the current pinned endpoint.
pinned() string
// errored handles error from server-side.
errored(addr string, err error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not we call through the down func returned from up?

@gyuho gyuho force-pushed the balancer branch 2 times, most recently from 5e0f81d to f6a3809 Compare October 9, 2017 16:55
@@ -87,14 +88,10 @@ func (c *Client) newRetryWrapper(isStop retryStopErrFunc) retryRpcFunc {
func (c *Client) newAuthRetryWrapper() retryRpcFunc {
return func(rpcCtx context.Context, f rpcFunc) error {
for {
pinned := c.balancer.pinned()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess this is still useful. let us keep this for now?

@@ -65,6 +65,7 @@ func (c *Client) newRetryWrapper(isStop retryStopErrFunc) retryRpcFunc {
}
if logger.V(4) {
logger.Infof("clientv3/retry: error %v on pinned endpoint %s", err, pinned)
c.balancer.down(pinned, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only error can reach here is timeout error?


func TestNetworkPartitionBalancerWithGet(t *testing.T) {
testNetworkPartitionBalancer(t, func(cli *clientv3.Client, ctx context.Context) error {
_, err := cli.Get(ctx, "foo")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for get, we should see no error. client should auto retry


func TestNetworkPartitionBalancerWithPut(t *testing.T) {
testNetworkPartitionBalancer(t, func(cli *clientv3.Client, ctx context.Context) error {
_, err := cli.Put(ctx, "foo", "bar")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for put we should expect deadline exceed error.

clus.Members[0].InjectPartition(t, clus.Members[1:])
defer clus.Members[0].RecoverPartition(t, clus.Members[1:])

for i := 0; i < 5; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get and put expect different behavior in testing. let not try to mix them together. probably we only need to reuse the code for setting up client and partitioning, not the actual test/check part.

@@ -44,6 +44,8 @@ type balancer interface {
endpoints() []string
// pinned returns the current pinned endpoint.
pinned() string
// errored handles error from server-side.
down(addr string, err error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok... endpointError is more appropriate actually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the back and forth on this one.

hb.mu.RUnlock()
if skip || !bad {
return true
}
// prevent isolated member's endpoint from being infinitely retried, as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this change affect what we are fixing in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Often, timed-out endpoint gets retried and passes this check, so gets pinned again. This is problematic when the member is the isolated one in network partitions still serving the health server as SERVING. In such case, we keep it in the unhealthy list so that other endpoints can be prioritized within the next time-out.

Basically we need this as well to fix the network partition issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this testable?

Copy link
Contributor Author

@gyuho gyuho Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to test since we can't control gRPC side. But with debugging enabled, I can confirm manually this happens.

client, err := NewClientV3(m)
if err != nil {
t.Fatalf("cannot create client: %v", err)
if !cfg.NoClient {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this for this PR?

Copy link
Contributor Author

@gyuho gyuho Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me remove it in the next push.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you.

@gyuho gyuho force-pushed the balancer branch 2 times, most recently from 08a9f4f to a043fbc Compare October 9, 2017 18:42
_, err := cli.Get(ctx, "a")
cancel()
if err != nil {
t.Fatal(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.error. also we need a better failure description. want err = nil, got %v

func TestNetworkPartitionBalancerWithPut(t *testing.T) {
testNetworkPartitionBalancer(t, func(cli *clientv3.Client) {
var err error
for i := 0; i < 5; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 5 times? would not a simple second retry go through?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slow CIs take several seconds to switch endpoints.
Instead, increased wait time with only 2 tries.

t.Fatalf("#%d: expected %v, got %v", i, context.DeadlineExceeded, err)
}
if i == 0 { // give enough time for endpoint switch
time.Sleep(5 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to get rid of this eventually. probably need a wait endpoint switch func.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can expose balancer interface with ConnectNotify. Should we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this pr. simply add a todo is good enough.

func (hb *healthBalancer) mayPin(addr grpc.Address) bool {
hb.mu.RLock()
skip := len(hb.addrs) == 1 || len(hb.unhealthy) == 0
_, bad := hb.unhealthy[addr.Addr]
failed, bad := hb.unhealthy[addr.Addr]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failedTime?

// instead, return before grpc-healthcheck if failed within healthcheck timeout
if elapsed := time.Since(failed); elapsed < dur {
if logger.V(4) {
logger.Infof("clientv3/health-balancer: %s is up but not pinned (elapsed %v < timeout %v)", addr.Addr, elapsed, dur)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably also log explicitly that we are not going to pin this since xxx.

@xiang90
Copy link
Contributor

xiang90 commented Oct 9, 2017

LGTM

gyuho added 2 commits October 9, 2017 13:40
Otherwise network-partitioned member with active health-check
server would not be gray-listed, making health-balancer stuck
with isolated endpoint.

Also clarifies some log messages.

Signed-off-by: Gyu-Ho Lee <[email protected]>
Previous behavior is when server returns errors, retry
wrapper does not do anything, while passively expecting
balancer to gray-list the isolated endpoint. This is
problematic when multiple endpoints are passed, and
network partition happens.

This patch adds 'endpointError' method to 'balancer' interface
to actively(possibly even before health-check API gets called)
handle RPC errors and gray-list endpoints for the time being,
thus speeding up the endpoint switch.

This is safe in a single-endpoint case, because balancer will
retry no matter what in such case.

Signed-off-by: Gyu-Ho Lee <[email protected]>
@gyuho gyuho merged commit 8329963 into etcd-io:master Oct 9, 2017
@gyuho gyuho deleted the balancer branch October 9, 2017 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Client balancer with linearized request does not work under network partition
2 participants