kvserver: remove extraneous circuit breaker check in Raft transport #69405

tbg · 2021-08-26T08:27:10Z

See #68419. We now use
DialNoBreaker for the raft transport, taking into account the previous
Ready() check.

DialNoBreaker was previously bypassing the breaker as it ought to but
was also not reporting to the breaker the result of the operation;
this is not ideal and was caught by the tests. This commit changes
DialNoBreaker to report the result (i.e. fail or success).

Release justification: bug fix
Release note (bug fix): Previously, after a temporary node outage, other
nodes in the cluster could fail to connect to the restarted node due to
their circuit breakers not resetting. This would manifest in the logs
via messages "unable to dial nXX: breaker open", where XX is the ID
of the restarted node. (Note that such errors are expected for nodes
that are truly unreachable, and may still occur around the time of
the restart, but for no longer than a few seconds).

cockroach-teamcity · 2021-08-26T08:27:17Z

This change is

erikgrinaker

This is going to result in the queue filling up when a node is unavailable, and all of those thousands of queued messages getting sent when it comes back up. Do we need some additional TTL logic or queue draining or something to prevent this, or do we already have safeguards for this?

tbg · 2021-08-26T09:05:05Z

That's not quite what will happen I think. Because of the second check

cockroach/pkg/kv/kvserver/raft_transport.go

Line 621 in e1d01d0

conn, err := t.dialer.Dial(ctx, toNodeID, class)

the newly created queue will quickly be torn down again (or slowly, depending on whether the breaker lets us through and if so if we fail-fast or fail-slow or, who knows, succeed). This isn't ideal but it seems better than the status quo. I'm currently looking into adding a control loop to the breakers (at least in our usage of them in nodedialer) but I doubt that this is something we're going to want to land now.

Also, the queue has a bounded size (at least in the number of messages), so there is some protection, though not the best one. Either way, the protection is no worse than for nodes which are live.

erikgrinaker · 2021-08-26T09:10:51Z

Right, I see that we actually drain the queue on dial failures, that's the sort of protection I was curious about.

cockroach/pkg/kv/kvserver/raft_transport.go

Lines 598 to 613 in e1d01d0

    
           cleanup := func(ch chan *RaftMessageRequest) { 
        
           	// Account for the remainder of `ch` which was never sent. 
        
           	// NB: we deleted the queue above, so within a short amount 
        
           	// of time nobody should be writing into the channel any 
        
           	// more. We might miss a message or two here, but that's 
        
           	// OK (there's nobody who can safely close the channel the 
        
           	// way the code is written). 
        
           	for { 
        
           		select { 
        
           		case <-ch: 
        
           			atomic.AddInt64(&stats.clientDropped, 1) 
        
           		default: 
        
           			return 
        
           		} 
        
           	} 
        
           }

Also, the queue has a bounded size (at least in the number of messages), so there is some protection, though not the best one. Either way, the protection is no worse than for nodes which are live.

I seem to recall the buffer size is something like 10-20k messages though. A live node will generally be able to keep up, and so the queue is generally small, but it's true that it would pile up if the remote node should struggle.

tbg · 2021-08-26T10:36:43Z

Mind giving this another look? The tests failed, and I decided to go the other route of bypassing the second circuit breaker check (while still reporting success to the breaker if things go smoothly). This should keep the behavioral change in this diff very very small (exactly fixes the bug, but no more). It was also forced upon me by the tests which were asserting that the queue doesn't even get started with the breaker open, and also tested that the second use of the breaker reports success. Surprisingly comprehensive and we're better off for it imo.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/kv/kvserver/raft_transport.go, line 620 at r1 (raw file):

		// checked the breaker. Checking it again can cause livelock, see:
		// https://github.com/cockroachdb/cockroach/issues/68419
		conn, err := t.dialer.DialNoBreaker(ctx, toNodeID, class)

This will bypass some internal health logging, which seems unfortunate. We may want to add some corresponding logging here.

cockroach/pkg/rpc/nodedialer/nodedialer.go

Lines 159 to 162 in f72206d

    
           // Enforce a minimum interval between warnings for failed connections. 
        
           if err != nil && ctx.Err() == nil && breaker != nil && breaker.ShouldLog() { 
        
           	log.Health.Infof(ctx, "unable to connect to n%d: %s", nodeID, err) 
        
           }

pkg/kv/kvserver/raft_transport.go, line 624 at r1 (raw file):

		breaker := t.dialer.GetCircuitBreaker(toNodeID, class)
		if err != nil {
			breaker.Fail(err)

We need to check ctx.Err() here, otherwise context cancellation will trip the breaker.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/rpc/nodedialer/nodedialer.go, line 94 at r2 (raw file):

}

// DialNoBreaker ignores the breaker if there is an error dialing. This function

This comment should be updated to say it ignores the breaker state and always tries to dial, but reports the result.

tbg

Almost there...

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvserver/raft_transport.go, line 620 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This will bypass some internal health logging, which seems unfortunate. We may want to add some corresponding logging here.

cockroach/pkg/rpc/nodedialer/nodedialer.go

Lines 159 to 162 in f72206d

// Enforce a minimum interval between warnings for failed connections.

if err != nil && ctx.Err() == nil && breaker != nil && breaker.ShouldLog() {

log.Health.Infof(ctx, "unable to connect to n%d: %s", nodeID, err)

}

I updated DialNoBreaker to notify the breaker instead. This makes everything work as expected. I needed to update DialNoBreaker to also break on resolver errors. The only other caller is the distSQL outbox, which doesn't care much about this, and #40691 didn't offer any explanation. I think it was just flat-out trying to touch the breaker at all, which in hindsight was not ideal. Now we do what should be the right thing: notify the breaker just like Dial would, but avoid failing fast.

pkg/kv/kvserver/raft_transport.go, line 624 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

We need to check ctx.Err() here, otherwise context cancellation will trip the breaker.

Obsolete.

pkg/rpc/nodedialer/nodedialer.go, line 94 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This comment should be updated to say it ignores the breaker state and always tries to dial, but reports the result.

Rewrote the comment.

addressed

erikgrinaker

Maybe consider rewording the commit message, although I suppose it's accurate as is too.

Reviewed 1 of 2 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

See cockroachdb#68419. We now use `DialNoBreaker` for the raft transport, taking into account the previous `Ready()` check. `DialNoBreaker` was previously bypassing the breaker as it ought to but was also *not reporting to the breaker* the result of the operation; this is not ideal and was caught by the tests. This commit changes `DialNoBreaker` to report the result (i.e. fail or success). Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).

tbg · 2021-08-27T09:54:08Z

bors r=erikgrinaker
Thanks for the quick turnaround!

craig · 2021-08-27T11:23:56Z

Build succeeded:

GitHub CI (Cockroach)

tbg · 2021-09-16T12:02:53Z

blathers backport 21.1

erikgrinaker · 2021-09-17T09:27:26Z

Can we get a backport to 20.2 as well? The original incident was with 20.2.13, so it applies.

tbg · 2021-09-17T09:33:39Z

blathers backport 20.2

blathers-crl · 2021-09-17T09:33:42Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 4304289 to blathers/backport-release-20.2-69405: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 20.2 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

tbg · 2021-09-17T09:48:36Z

Ok, time to build RocksDB again then :-)

…

On Fri, Sep 17, 2021 at 11:33 AM blathers-crl[bot] ***@***.***> wrote: Encountered an error creating backports. Some common things that can go wrong: 1. The backport branch might have already existed. 2. There was a merge conflict. 3. The backport branch contained merge commits. You might need to create your backport manually using the backport <https://github.com/benesch/backport> tool. ------------------------------ error creating merge commit from 4304289 <4304289> to blathers/backport-release-20.2-69405: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 20.2 failed. See errors above. ------------------------------ 🦉 Hoot! I am a Blathers <https://github.com/apps/blathers-crl>, a bot for CockroachDB <https://github.com/cockroachdb>. My owner is otan <https://github.com/otan>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#69405 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGXPZEZNWR6QOBTWFH6VCTUCMDQFANCNFSM5C22F2OA> .

tbg requested a review from a team as a code owner August 26, 2021 08:27

erikgrinaker reviewed Aug 26, 2021

View reviewed changes

erikgrinaker approved these changes Aug 26, 2021

View reviewed changes

tbg force-pushed the fix-breaker branch from 6d5ae69 to c963f23 Compare August 26, 2021 10:34

erikgrinaker previously requested changes Aug 26, 2021

View reviewed changes

tbg force-pushed the fix-breaker branch from c963f23 to 81ba3c7 Compare August 26, 2021 11:07

erikgrinaker reviewed Aug 26, 2021

View reviewed changes

tbg force-pushed the fix-breaker branch 2 times, most recently from e0b6f54 to d5bf49a Compare August 26, 2021 12:33

tbg requested a review from erikgrinaker August 26, 2021 12:33

tbg commented Aug 26, 2021

View reviewed changes

erikgrinaker approved these changes Aug 26, 2021

View reviewed changes

tbg force-pushed the fix-breaker branch from d5bf49a to 4304289 Compare August 26, 2021 15:14

craig bot merged commit 51b4726 into cockroachdb:master Aug 27, 2021

erikgrinaker mentioned this pull request Sep 13, 2021

rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419

Closed

tbg deleted the fix-breaker branch September 16, 2021 12:02

blathers-crl bot mentioned this pull request Sep 16, 2021

release-21.1: kvserver: remove extraneous circuit breaker check in Raft transport #70311

Merged

This was referenced Sep 17, 2021

release-20.2: kvserver: remove extraneous circuit breaker check in Raft transport #70349

Closed

release-20.2: kvserver: remove extraneous circuit breaker check in Raft transport #70353

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: remove extraneous circuit breaker check in Raft transport #69405

kvserver: remove extraneous circuit breaker check in Raft transport #69405

tbg commented Aug 26, 2021 •

edited

Loading

cockroach-teamcity commented Aug 26, 2021

erikgrinaker left a comment

tbg commented Aug 26, 2021

erikgrinaker commented Aug 26, 2021

tbg commented Aug 26, 2021

erikgrinaker left a comment

erikgrinaker left a comment

tbg left a comment

erikgrinaker left a comment

tbg commented Aug 27, 2021

craig bot commented Aug 27, 2021

tbg commented Sep 16, 2021

erikgrinaker commented Sep 17, 2021

tbg commented Sep 17, 2021

blathers-crl bot commented Sep 17, 2021

tbg commented Sep 17, 2021 via email

	// Enforce a minimum interval between warnings for failed connections.
	if err != nil && ctx.Err() == nil && breaker != nil && breaker.ShouldLog() {
	log.Health.Infof(ctx, "unable to connect to n%d: %s", nodeID, err)
	}

kvserver: remove extraneous circuit breaker check in Raft transport #69405

kvserver: remove extraneous circuit breaker check in Raft transport #69405

Conversation

tbg commented Aug 26, 2021 • edited Loading

cockroach-teamcity commented Aug 26, 2021

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg commented Aug 26, 2021

erikgrinaker commented Aug 26, 2021

tbg commented Aug 26, 2021

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg commented Aug 27, 2021

craig bot commented Aug 27, 2021

tbg commented Sep 16, 2021

erikgrinaker commented Sep 17, 2021

tbg commented Sep 17, 2021

blathers-crl bot commented Sep 17, 2021

tbg commented Sep 17, 2021 via email

tbg commented Aug 26, 2021 •

edited

Loading