-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC] kvserver: crash once and cordon replica in case of an apply fatal error #78092
Conversation
This commit is a POC of an approach to reducing blast radius of panics or fatal errors that happen during application. Without this commit, kvserver repeatedly restarts. Repeateldy restarting means a cluster wide impact on reliability. It also makes debugging hard, as tools like the admin UI are affected by the restarts. We especially note the impact on serverless. Repeatedly restarting means that an apply fatal error results in an outage that affects multiple tenants, even if the range with the apply fatal error is in a single tenant's keyspace (which is likely). With this commit, kvserver crashes a single time and cordons the replica at start up time. Cordoning means the raft scheduler doesn't schedule the replica. Since cordoning happens post restart, and since a node can't resume its leases post restart, the node will shed its lease of the cordoned range. As a result, if only one replica of 3+ experiences an apply fatal error, the range will stay available. If a quorum experience an apply fatal error on the other hand, the range will become unavailable. Reads and writes to the range will fail fast, rather than hang, to improve the user experience. There has been some discussion about removing the replica in case a single replica experiences an apply fatal error. This commit doesn't chew that off. Even without that, this commit is an improvement over the status quo. This commit introduces an integration test that tests the behavior of a one node CRDB cluster in face of an apply fatal error. I have yet to test the behavor of a multi-node cluster. Release note: None. Release note (<category, see below>): <what> <show> <why>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I really like the general idea. I gave the test a "real review" since even though this is a POC, the test will be relatively independent of the final implementation.
For the rest of the code, I tried to keep it high level and I didn't inspect in detail some bits that I suspect we won't need anyway. The main point is that I think the destroyStatus
might be the right place to store the cordoning state, as this is a pre-existing way to keep a Replica around but to essentially prevent it from getting used. Whenever we make a Replica
we'd read the cordoned state, and if found, populate the destroyStatus
before handing the replica to the caller. The right place to do so would be around here:
cockroach/pkg/kv/kvserver/replica_init.go
Lines 176 to 179 in 625ee8b
var err error | |
if r.mu.state, err = r.mu.stateLoader.Load(ctx, r.Engine(), desc); err != nil { | |
return err | |
} |
As the surrounding method is always called before a replica can be used.
I know that you've talked to Erik a bunch, so it would be good to wait for his feedback as well before you act on my advice, as he has more context.
|
||
ctx := context.Background() | ||
|
||
dir, cleanupFn := testutils.TempDir(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use this Sticky stuff, you don't even need to touch disk (which makes this test a lot faster especially when stressing):
cockroach/pkg/kv/kvserver/client_metrics_test.go
Lines 254 to 264 in 208e2b4
stickyEngineRegistry := server.NewStickyInMemEnginesRegistry() | |
defer stickyEngineRegistry.CloseAllStickyInMemEngines() | |
const numServers int = 3 | |
stickyServerArgs := make(map[int]base.TestServerArgs) | |
for i := 0; i < numServers; i++ { | |
stickyServerArgs[i] = base.TestServerArgs{ | |
CacheSize: 1 << 20, /* 1 MiB */ | |
StoreSpecs: []base.StoreSpec{ | |
{ | |
InMemory: true, | |
StickyInMemoryEngineID: strconv.FormatInt(int64(i), 10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh cool!
for _, ru := range args.Req.Requests { | ||
key := ru.GetInner().Header().Key | ||
// The time-series range is continuously written to. | ||
if bytes.HasPrefix(key, keys.TimeseriesPrefix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly trigger the problem from a test by special-casing a request here. For example,
if bytes.HasSuffix(key, "boom") { ... }
then you should be able to inject panics on any range you like by doing a put on append(desc.StartKey.AsRawKey(), "boom"...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
// instead? It allow follower reads when possible. See below: | ||
// https://github.com/cockroachdb/cockroach/pull/76858 | ||
if r.store.cordoned[_forStacks] { | ||
return nil, roachpb.NewError(errors.Newf("range %v is cordoned", _forStacks)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final code will probably want to call r.replicaUnavailableError(...)
here to make a nice error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Should we be using your circuit breaker here? See comment up above about rationale. TLDR: I guess even with two replicas cordoned, we could still serve follower reads.
@@ -168,6 +168,7 @@ type raftScheduler struct { | |||
processor raftProcessor | |||
latency *metric.Histogram | |||
numWorkers int | |||
cordoned map[roachpb.RangeID]bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be better as a new state for replica.mu.destroyStatus
:
cockroach/pkg/kv/kvserver/replica_destroy.go
Lines 31 to 40 in 84e3fdf
const ( | |
// The replica is alive. | |
destroyReasonAlive DestroyReason = iota | |
// The replica has been GCed or is in the process of being synchronously | |
// removed. | |
destroyReasonRemoved | |
// The replica has been merged into its left-hand neighbor, but its left-hand | |
// neighbor hasn't yet subsumed it. | |
destroyReasonMergePending | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this state should be on the replica. Don't know all the implications of destroyStatus
off the bat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check this out!
// per-replica proposal quota hold back writes to the range? That is not | ||
// desirable! Some relevant discussion: | ||
// https://github.com/cockroachdb/cockroach/issues/77251 | ||
if s.cordoned[id] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the destroyStatus is a good way to streamline this, as "destroyed" (for a suitable definition, which will need to be expanded) are never raft handled:
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 1508 to 1514 in 33fe9d1
func (r *Replica) withRaftGroupLocked( | |
mayCampaignOnWake bool, f func(r *raft.RawNode) (unquiesceAndWakeLeader bool, _ error), | |
) error { | |
if r.mu.destroyStatus.Removed() { | |
// Callers know to detect errRemoved as non-fatal. | |
return errRemoved | |
} |
Instead of Removed()
we'd want a method Usable() error
or something, which would either return nil
, errRemoved
, or errCordoned
.
var err error | ||
s.cordoned, err = replicasToCordon(ctx, eng) | ||
if err != nil { | ||
// TODO(josh): Is it okay to fatal here in case the read to storage fails? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the final code the cordoning will be decided on a per-replica basis when it is instantiated, so you wouldn't need this code.
If we needed this code, we'd want proper error handling.
s.draining.Store(false) | ||
s.scheduler = newRaftScheduler(cfg.AmbientCtx, s.metrics, s, storeSchedulerConcurrency) | ||
s.scheduler = newRaftScheduler(cfg.AmbientCtx, s.metrics, s, storeSchedulerConcurrency, s.cordoned) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With destroyStatus in place the raft sched should not need to know about a list of cordoned ranges in the end.
@@ -503,6 +505,18 @@ func (s *Store) processReady(rangeID roachpb.RangeID) { | |||
} | |||
|
|||
ctx := r.raftCtx | |||
// If panic (or fatal error since maybeFatalOnRaftReadyErr converts | |||
// these into panics), we mark the replica for cordoning and crash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a really good approach. When you recover from panics, you always have to worry about what state you're leaving the system in. Since raft processing may also acquire store mutexes and change store data structures (splits/merges), there could be cases in which we recover the panic and cordon the replica, but damage has really been done store-wide. Crashing once elegantly avoids this problem.
I think this is worth as a prominent comment somewhere in the final version as it highlights how avoiding the crash is a bigger undertaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
} | ||
|
||
func (s *Store) markReplicaForCordoningIfPanic(ctx context.Context, rangeID roachpb.RangeID) { | ||
if re := recover(); re != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's nice to use early-returns to avoid nested ifs, so you could
r := recover()
if r == nil { return }
// rest of code
StoreSpecs: []base.StoreSpec{{Path: dir + "/store-1"}}, | ||
Knobs: base.TestingKnobs{ | ||
Store: &kvserver.StoreTestingKnobs{ | ||
DontPanicOnApplyPanicOrFatalError: dontPanicOnApplyPanicOrFatalError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for this awkward testing knob would disappear once we trigger the panic using a special request as suggested above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just dumping a quick initial review, since I have meetings the rest of the day. Overall, this makes sense, and I agree with tbg's comments. Two things:
-
There needs to be a mechanism to uncordon the range. The simplest would be a new
cockroach
CLI subcommand that deletes the cordon key. -
We need to ensure the replica is properly isolated. What happens if we send a read/write request to it? Will it get scheduled in any queues? Etc.
|
||
// TODO(josh): For the POC, we only allow cordoning a single range at a time. | ||
c := cordonRecord{RangeID: rangeID} | ||
// TODO(josh): Use proto instead of yaml. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should put the range ID into the key as a suffix. In that case I don't think we need a serialized value at all, nil
should do just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do it the way you suggest, then wouldn't we need to GC a cordonRecord
in case a range is removed? If we have a single cordonRecord
per store, we maybe we can avoid this complexity? Not sure how desirable that is, but I think it's worth mentioning at least!
Also, if we stick with how I have this now, we can expand that cordonRecord
structure to enable N=5 cordons per store max, by making the structure a list of range IDs.
log.Errorf(ctx, "couldn't marshall cordon record: %v", err) | ||
return | ||
} | ||
// TODO(josh): I don't think we need an fsync here? The process will restart, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, an fsync seems unnecessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Is there an easy way to tell storage to NOT fsync
? I can also look around myself.
// discussion about the possibility of a read being served after a panic / | ||
// error but before the process restarts. This code increases that time | ||
// window, but perhaps we can skip the fsync to reduce it again. | ||
if err := s.engine.PutUnversioned(keys.StoreCordonRangeKey(), mc); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about this hanging if there is a disk-related issue that caused the apply failure. Since we don't use contexts for disk IO, we should likely spawn off a goroutine with a small timeout and continue to crash if it doesn't make it in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo good point! Ack re: context.
@@ -168,6 +168,7 @@ type raftScheduler struct { | |||
processor raftProcessor | |||
latency *metric.Histogram | |||
numWorkers int | |||
cordoned map[roachpb.RangeID]bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this state should be on the replica. Don't know all the implications of destroyStatus
off the bat.
Thanks for the feedback, @erikgrinaker and @tbg, and thanks for all the help, @erikgrinaker! I'm happy the high level approach makes sense. |
Ack! I'd like it if the CLI tool didn't require exclusive access to the store itself, as this is a very annoying requirement in certain operational contexts (k8s so CC). So maybe it needs to send an RPC. Another idea is a cluster setting.
Re: read & writes, I think the changes to Re: queues, I'll poke at that. I have no concrete mental model there ATM. |
#75944
kvserver: crash once and cordon replica in case of an apply fatal error
This commit is a POC of an approach to reducing blast radius of panics or
fatal errors that happen during application.
Without this commit, kvserver repeatedly restarts. Repeateldy restarting
means a cluster wide impact on reliability. It also makes debugging hard,
as tools like the admin UI are affected by the restarts. We especially note
the impact on serverless. Repeatedly restarting means that an apply fatal error
results in an outage that affects multiple tenants, even if the range with the
apply fatal error is in a single tenant's keyspace (which is likely).
With this commit, kvserver crashes a single time and cordons the replica at
start up time. Cordoning means the raft scheduler doesn't schedule the replica.
Since cordoning happens post restart, and since a node can't resume its leases
post restart, the node will shed its lease of the cordoned range. As a result,
if only one replica of 3+ experiences an apply fatal error, the range will stay
available. If a quorum experience an apply fatal error on the other hand, the
range will become unavailable. Reads and writes to the range will fail fast,
rather than hang, to improve the user experience.
There has been some discussion about removing the replica in case a single
replica experiences an apply fatal error. This commit doesn't chew that off.
Even without that, this commit is an improvement over the status quo.
This commit introduces an integration test that tests the behavior of a one
node CRDB cluster in face of an apply fatal error. I have yet to test the
behavor of a multi-node cluster.
Release note: None.