Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: remove extraneous circuit breaker check in Raft transport #69405

Merged
merged 1 commit into from
Aug 27, 2021

Conversation

tbg
Copy link
Member

@tbg tbg commented Aug 26, 2021

See #68419. We now use
DialNoBreaker for the raft transport, taking into account the previous
Ready() check.

DialNoBreaker was previously bypassing the breaker as it ought to but
was also not reporting to the breaker the result of the operation;
this is not ideal and was caught by the tests. This commit changes
DialNoBreaker to report the result (i.e. fail or success).

Release justification: bug fix
Release note (bug fix): Previously, after a temporary node outage, other
nodes in the cluster could fail to connect to the restarted node due to
their circuit breakers not resetting. This would manifest in the logs
via messages "unable to dial nXX: breaker open", where XX is the ID
of the restarted node. (Note that such errors are expected for nodes
that are truly unreachable, and may still occur around the time of
the restart, but for no longer than a few seconds).

@tbg tbg requested a review from a team as a code owner August 26, 2021 08:27
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to result in the queue filling up when a node is unavailable, and all of those thousands of queued messages getting sent when it comes back up. Do we need some additional TTL logic or queue draining or something to prevent this, or do we already have safeguards for this?

@tbg
Copy link
Member Author

tbg commented Aug 26, 2021

That's not quite what will happen I think. Because of the second check

conn, err := t.dialer.Dial(ctx, toNodeID, class)

the newly created queue will quickly be torn down again (or slowly, depending on whether the breaker lets us through and if so if we fail-fast or fail-slow or, who knows, succeed). This isn't ideal but it seems better than the status quo. I'm currently looking into adding a control loop to the breakers (at least in our usage of them in nodedialer) but I doubt that this is something we're going to want to land now.

Also, the queue has a bounded size (at least in the number of messages), so there is some protection, though not the best one. Either way, the protection is no worse than for nodes which are live.

@erikgrinaker
Copy link
Contributor

Right, I see that we actually drain the queue on dial failures, that's the sort of protection I was curious about.

cleanup := func(ch chan *RaftMessageRequest) {
// Account for the remainder of `ch` which was never sent.
// NB: we deleted the queue above, so within a short amount
// of time nobody should be writing into the channel any
// more. We might miss a message or two here, but that's
// OK (there's nobody who can safely close the channel the
// way the code is written).
for {
select {
case <-ch:
atomic.AddInt64(&stats.clientDropped, 1)
default:
return
}
}
}

Also, the queue has a bounded size (at least in the number of messages), so there is some protection, though not the best one. Either way, the protection is no worse than for nodes which are live.

I seem to recall the buffer size is something like 10-20k messages though. A live node will generally be able to keep up, and so the queue is generally small, but it's true that it would pile up if the remote node should struggle.

@tbg
Copy link
Member Author

tbg commented Aug 26, 2021

Mind giving this another look? The tests failed, and I decided to go the other route of bypassing the second circuit breaker check (while still reporting success to the breaker if things go smoothly). This should keep the behavioral change in this diff very very small (exactly fixes the bug, but no more). It was also forced upon me by the tests which were asserting that the queue doesn't even get started with the breaker open, and also tested that the second use of the breaker reports success. Surprisingly comprehensive and we're better off for it imo.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvserver/raft_transport.go, line 620 at r1 (raw file):

		// checked the breaker. Checking it again can cause livelock, see:
		// https://github.com/cockroachdb/cockroach/issues/68419
		conn, err := t.dialer.DialNoBreaker(ctx, toNodeID, class)

This will bypass some internal health logging, which seems unfortunate. We may want to add some corresponding logging here.

// Enforce a minimum interval between warnings for failed connections.
if err != nil && ctx.Err() == nil && breaker != nil && breaker.ShouldLog() {
log.Health.Infof(ctx, "unable to connect to n%d: %s", nodeID, err)
}


pkg/kv/kvserver/raft_transport.go, line 624 at r1 (raw file):

		breaker := t.dialer.GetCircuitBreaker(toNodeID, class)
		if err != nil {
			breaker.Fail(err)

We need to check ctx.Err() here, otherwise context cancellation will trip the breaker.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/rpc/nodedialer/nodedialer.go, line 94 at r2 (raw file):

}

// DialNoBreaker ignores the breaker if there is an error dialing. This function

This comment should be updated to say it ignores the breaker state and always tries to dial, but reports the result.

@tbg tbg force-pushed the fix-breaker branch 2 times, most recently from e0b6f54 to d5bf49a Compare August 26, 2021 12:33
@tbg tbg requested a review from erikgrinaker August 26, 2021 12:33
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there...

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/raft_transport.go, line 620 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This will bypass some internal health logging, which seems unfortunate. We may want to add some corresponding logging here.

// Enforce a minimum interval between warnings for failed connections.
if err != nil && ctx.Err() == nil && breaker != nil && breaker.ShouldLog() {
log.Health.Infof(ctx, "unable to connect to n%d: %s", nodeID, err)
}

I updated DialNoBreaker to notify the breaker instead. This makes everything work as expected. I needed to update DialNoBreaker to also break on resolver errors. The only other caller is the distSQL outbox, which doesn't care much about this, and #40691 didn't offer any explanation. I think it was just flat-out trying to touch the breaker at all, which in hindsight was not ideal. Now we do what should be the right thing: notify the breaker just like Dial would, but avoid failing fast.


pkg/kv/kvserver/raft_transport.go, line 624 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

We need to check ctx.Err() here, otherwise context cancellation will trip the breaker.

Obsolete.


pkg/rpc/nodedialer/nodedialer.go, line 94 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This comment should be updated to say it ignores the breaker state and always tries to dial, but reports the result.

Rewrote the comment.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Maybe consider rewording the commit message, although I suppose it's accurate as is too.

Reviewed 1 of 2 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

See cockroachdb#68419. We now use
`DialNoBreaker` for the raft transport, taking into account the previous
`Ready()` check.

`DialNoBreaker` was previously bypassing the breaker as it ought to but
was also *not reporting to the breaker* the result of the operation;
this is not ideal and was caught by the tests. This commit changes
`DialNoBreaker` to report the result (i.e. fail or success).

Release justification: bug fix
Release note (bug fix): Previously, after a temporary node outage, other
nodes in the cluster could fail to connect to the restarted node due to
their circuit breakers not resetting. This would manifest in the logs
via messages "unable to dial nXX: breaker open", where `XX` is the ID
of the restarted node. (Note that such errors are expected for nodes
that are truly unreachable, and may still occur around the time of
the restart, but for no longer than a few seconds).
@tbg
Copy link
Member Author

tbg commented Aug 27, 2021

bors r=erikgrinaker
Thanks for the quick turnaround!

@craig
Copy link
Contributor

craig bot commented Aug 27, 2021

Build succeeded:

@tbg
Copy link
Member Author

tbg commented Sep 16, 2021

blathers backport 21.1

@erikgrinaker
Copy link
Contributor

Can we get a backport to 20.2 as well? The original incident was with 20.2.13, so it applies.

@tbg
Copy link
Member Author

tbg commented Sep 17, 2021

blathers backport 20.2

@blathers-crl
Copy link

blathers-crl bot commented Sep 17, 2021

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 4304289 to blathers/backport-release-20.2-69405: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 20.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@tbg
Copy link
Member Author

tbg commented Sep 17, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants