Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] kvserver: crash once and cordon replica in case of an apply fatal error #78092

Closed
wants to merge 1 commit into from

Conversation

joshimhoff
Copy link
Collaborator

#75944

kvserver: crash once and cordon replica in case of an apply fatal error

This commit is a POC of an approach to reducing blast radius of panics or
fatal errors that happen during application.

Without this commit, kvserver repeatedly restarts. Repeateldy restarting
means a cluster wide impact on reliability. It also makes debugging hard,
as tools like the admin UI are affected by the restarts. We especially note
the impact on serverless. Repeatedly restarting means that an apply fatal error
results in an outage that affects multiple tenants, even if the range with the
apply fatal error is in a single tenant's keyspace (which is likely).

With this commit, kvserver crashes a single time and cordons the replica at
start up time. Cordoning means the raft scheduler doesn't schedule the replica.
Since cordoning happens post restart, and since a node can't resume its leases
post restart, the node will shed its lease of the cordoned range. As a result,
if only one replica of 3+ experiences an apply fatal error, the range will stay
available. If a quorum experience an apply fatal error on the other hand, the
range will become unavailable. Reads and writes to the range will fail fast,
rather than hang, to improve the user experience.

There has been some discussion about removing the replica in case a single
replica experiences an apply fatal error. This commit doesn't chew that off.
Even without that, this commit is an improvement over the status quo.

This commit introduces an integration test that tests the behavior of a one
node CRDB cluster in face of an apply fatal error. I have yet to test the
behavor of a multi-node cluster.

Release note: None.

This commit is a POC of an approach to reducing blast radius of panics or
fatal errors that happen during application.

Without this commit, kvserver repeatedly restarts. Repeateldy restarting
means a cluster wide impact on reliability. It also makes debugging hard,
as tools like the admin UI are affected by the restarts. We especially note
the impact on serverless. Repeatedly restarting means that an apply fatal error
results in an outage that affects multiple tenants, even if the range with the
apply fatal error is in a single tenant's keyspace (which is likely).

With this commit, kvserver crashes a single time and cordons the replica at
start up time. Cordoning means the raft scheduler doesn't schedule the replica.
Since cordoning happens post restart, and since a node can't resume its leases
post restart, the node will shed its lease of the cordoned range. As a result,
if only one replica of 3+ experiences an apply fatal error, the range will stay
available. If a quorum experience an apply fatal error on the other hand, the
range will become unavailable. Reads and writes to the range will fail fast,
rather than hang, to improve the user experience.

There has been some discussion about removing the replica in case a single
replica experiences an apply fatal error. This commit doesn't chew that off.
Even without that, this commit is an improvement over the status quo.

This commit introduces an integration test that tests the behavior of a one
node CRDB cluster in face of an apply fatal error. I have yet to test the
behavor of a multi-node cluster.

Release note: None.

Release note (<category, see below>): <what> <show> <why>
@joshimhoff joshimhoff requested a review from a team as a code owner March 18, 2022 17:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I really like the general idea. I gave the test a "real review" since even though this is a POC, the test will be relatively independent of the final implementation.
For the rest of the code, I tried to keep it high level and I didn't inspect in detail some bits that I suspect we won't need anyway. The main point is that I think the destroyStatus might be the right place to store the cordoning state, as this is a pre-existing way to keep a Replica around but to essentially prevent it from getting used. Whenever we make a Replica we'd read the cordoned state, and if found, populate the destroyStatus before handing the replica to the caller. The right place to do so would be around here:

var err error
if r.mu.state, err = r.mu.stateLoader.Load(ctx, r.Engine(), desc); err != nil {
return err
}

As the surrounding method is always called before a replica can be used.

I know that you've talked to Erik a bunch, so it would be good to wait for his feedback as well before you act on my advice, as he has more context.


ctx := context.Background()

dir, cleanupFn := testutils.TempDir(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use this Sticky stuff, you don't even need to touch disk (which makes this test a lot faster especially when stressing):

stickyEngineRegistry := server.NewStickyInMemEnginesRegistry()
defer stickyEngineRegistry.CloseAllStickyInMemEngines()
const numServers int = 3
stickyServerArgs := make(map[int]base.TestServerArgs)
for i := 0; i < numServers; i++ {
stickyServerArgs[i] = base.TestServerArgs{
CacheSize: 1 << 20, /* 1 MiB */
StoreSpecs: []base.StoreSpec{
{
InMemory: true,
StickyInMemoryEngineID: strconv.FormatInt(int64(i), 10),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh cool!

for _, ru := range args.Req.Requests {
key := ru.GetInner().Header().Key
// The time-series range is continuously written to.
if bytes.HasPrefix(key, keys.TimeseriesPrefix) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly trigger the problem from a test by special-casing a request here. For example,

if bytes.HasSuffix(key, "boom") { ... }

then you should be able to inject panics on any range you like by doing a put on append(desc.StartKey.AsRawKey(), "boom"...).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

// instead? It allow follower reads when possible. See below:
// https://github.com/cockroachdb/cockroach/pull/76858
if r.store.cordoned[_forStacks] {
return nil, roachpb.NewError(errors.Newf("range %v is cordoned", _forStacks))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final code will probably want to call r.replicaUnavailableError(...) here to make a nice error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Should we be using your circuit breaker here? See comment up above about rationale. TLDR: I guess even with two replicas cordoned, we could still serve follower reads.

@@ -168,6 +168,7 @@ type raftScheduler struct {
processor raftProcessor
latency *metric.Histogram
numWorkers int
cordoned map[roachpb.RangeID]bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better as a new state for replica.mu.destroyStatus:

const (
// The replica is alive.
destroyReasonAlive DestroyReason = iota
// The replica has been GCed or is in the process of being synchronously
// removed.
destroyReasonRemoved
// The replica has been merged into its left-hand neighbor, but its left-hand
// neighbor hasn't yet subsumed it.
destroyReasonMergePending
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this state should be on the replica. Don't know all the implications of destroyStatus off the bat.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check this out!

// per-replica proposal quota hold back writes to the range? That is not
// desirable! Some relevant discussion:
// https://github.com/cockroachdb/cockroach/issues/77251
if s.cordoned[id] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the destroyStatus is a good way to streamline this, as "destroyed" (for a suitable definition, which will need to be expanded) are never raft handled:

func (r *Replica) withRaftGroupLocked(
mayCampaignOnWake bool, f func(r *raft.RawNode) (unquiesceAndWakeLeader bool, _ error),
) error {
if r.mu.destroyStatus.Removed() {
// Callers know to detect errRemoved as non-fatal.
return errRemoved
}

Instead of Removed() we'd want a method Usable() error or something, which would either return nil, errRemoved, or errCordoned.

var err error
s.cordoned, err = replicasToCordon(ctx, eng)
if err != nil {
// TODO(josh): Is it okay to fatal here in case the read to storage fails?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the final code the cordoning will be decided on a per-replica basis when it is instantiated, so you wouldn't need this code.
If we needed this code, we'd want proper error handling.

s.draining.Store(false)
s.scheduler = newRaftScheduler(cfg.AmbientCtx, s.metrics, s, storeSchedulerConcurrency)
s.scheduler = newRaftScheduler(cfg.AmbientCtx, s.metrics, s, storeSchedulerConcurrency, s.cordoned)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With destroyStatus in place the raft sched should not need to know about a list of cordoned ranges in the end.

@@ -503,6 +505,18 @@ func (s *Store) processReady(rangeID roachpb.RangeID) {
}

ctx := r.raftCtx
// If panic (or fatal error since maybeFatalOnRaftReadyErr converts
// these into panics), we mark the replica for cordoning and crash.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a really good approach. When you recover from panics, you always have to worry about what state you're leaving the system in. Since raft processing may also acquire store mutexes and change store data structures (splits/merges), there could be cases in which we recover the panic and cordon the replica, but damage has really been done store-wide. Crashing once elegantly avoids this problem.
I think this is worth as a prominent comment somewhere in the final version as it highlights how avoiding the crash is a bigger undertaking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

}

func (s *Store) markReplicaForCordoningIfPanic(ctx context.Context, rangeID roachpb.RangeID) {
if re := recover(); re != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's nice to use early-returns to avoid nested ifs, so you could

r := recover()
if r == nil { return }
// rest of code

StoreSpecs: []base.StoreSpec{{Path: dir + "/store-1"}},
Knobs: base.TestingKnobs{
Store: &kvserver.StoreTestingKnobs{
DontPanicOnApplyPanicOrFatalError: dontPanicOnApplyPanicOrFatalError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for this awkward testing knob would disappear once we trigger the panic using a special request as suggested above.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just dumping a quick initial review, since I have meetings the rest of the day. Overall, this makes sense, and I agree with tbg's comments. Two things:

  1. There needs to be a mechanism to uncordon the range. The simplest would be a new cockroach CLI subcommand that deletes the cordon key.

  2. We need to ensure the replica is properly isolated. What happens if we send a read/write request to it? Will it get scheduled in any queues? Etc.


// TODO(josh): For the POC, we only allow cordoning a single range at a time.
c := cordonRecord{RangeID: rangeID}
// TODO(josh): Use proto instead of yaml.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put the range ID into the key as a suffix. In that case I don't think we need a serialized value at all, nil should do just fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do it the way you suggest, then wouldn't we need to GC a cordonRecord in case a range is removed? If we have a single cordonRecord per store, we maybe we can avoid this complexity? Not sure how desirable that is, but I think it's worth mentioning at least!

Also, if we stick with how I have this now, we can expand that cordonRecord structure to enable N=5 cordons per store max, by making the structure a list of range IDs.

log.Errorf(ctx, "couldn't marshall cordon record: %v", err)
return
}
// TODO(josh): I don't think we need an fsync here? The process will restart,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, an fsync seems unnecessary here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Is there an easy way to tell storage to NOT fsync? I can also look around myself.

// discussion about the possibility of a read being served after a panic /
// error but before the process restarts. This code increases that time
// window, but perhaps we can skip the fsync to reduce it again.
if err := s.engine.PutUnversioned(keys.StoreCordonRangeKey(), mc); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about this hanging if there is a disk-related issue that caused the apply failure. Since we don't use contexts for disk IO, we should likely spawn off a goroutine with a small timeout and continue to crash if it doesn't make it in time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo good point! Ack re: context.

@@ -168,6 +168,7 @@ type raftScheduler struct {
processor raftProcessor
latency *metric.Histogram
numWorkers int
cordoned map[roachpb.RangeID]bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this state should be on the replica. Don't know all the implications of destroyStatus off the bat.

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Mar 21, 2022

Thanks for the feedback, @erikgrinaker and @tbg, and thanks for all the help, @erikgrinaker! I'm happy the high level approach makes sense.

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Mar 21, 2022

There needs to be a mechanism to uncordon the range. The simplest would be a new cockroach CLI subcommand that deletes the cordon key.

Ack! I'd like it if the CLI tool didn't require exclusive access to the store itself, as this is a very annoying requirement in certain operational contexts (k8s so CC). So maybe it needs to send an RPC. Another idea is a cluster setting.

We need to ensure the replica is properly isolated. What happens if we send a read/write request to it? Will it get scheduled in any queues?

Re: read & writes, I think the changes to relica_send.go should be sufficient? The integration test tests this too.

Re: queues, I'll poke at that. I have no concrete mental model there ATM.

@joshimhoff joshimhoff closed this Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants