[POC] kvserver: crash once and cordon replica in case of an apply fatal error #78092

joshimhoff · 2022-03-18T17:04:58Z

kvserver: crash once and cordon replica in case of an apply fatal error

This commit is a POC of an approach to reducing blast radius of panics or
fatal errors that happen during application.

Without this commit, kvserver repeatedly restarts. Repeateldy restarting
means a cluster wide impact on reliability. It also makes debugging hard,
as tools like the admin UI are affected by the restarts. We especially note
the impact on serverless. Repeatedly restarting means that an apply fatal error
results in an outage that affects multiple tenants, even if the range with the
apply fatal error is in a single tenant's keyspace (which is likely).

With this commit, kvserver crashes a single time and cordons the replica at
start up time. Cordoning means the raft scheduler doesn't schedule the replica.
Since cordoning happens post restart, and since a node can't resume its leases
post restart, the node will shed its lease of the cordoned range. As a result,
if only one replica of 3+ experiences an apply fatal error, the range will stay
available. If a quorum experience an apply fatal error on the other hand, the
range will become unavailable. Reads and writes to the range will fail fast,
rather than hang, to improve the user experience.

There has been some discussion about removing the replica in case a single
replica experiences an apply fatal error. This commit doesn't chew that off.
Even without that, this commit is an improvement over the status quo.

This commit introduces an integration test that tests the behavior of a one
node CRDB cluster in face of an apply fatal error. I have yet to test the
behavor of a multi-node cluster.

Release note: None.

This commit is a POC of an approach to reducing blast radius of panics or fatal errors that happen during application. Without this commit, kvserver repeatedly restarts. Repeateldy restarting means a cluster wide impact on reliability. It also makes debugging hard, as tools like the admin UI are affected by the restarts. We especially note the impact on serverless. Repeatedly restarting means that an apply fatal error results in an outage that affects multiple tenants, even if the range with the apply fatal error is in a single tenant's keyspace (which is likely). With this commit, kvserver crashes a single time and cordons the replica at start up time. Cordoning means the raft scheduler doesn't schedule the replica. Since cordoning happens post restart, and since a node can't resume its leases post restart, the node will shed its lease of the cordoned range. As a result, if only one replica of 3+ experiences an apply fatal error, the range will stay available. If a quorum experience an apply fatal error on the other hand, the range will become unavailable. Reads and writes to the range will fail fast, rather than hang, to improve the user experience. There has been some discussion about removing the replica in case a single replica experiences an apply fatal error. This commit doesn't chew that off. Even without that, this commit is an improvement over the status quo. This commit introduces an integration test that tests the behavior of a one node CRDB cluster in face of an apply fatal error. I have yet to test the behavor of a multi-node cluster. Release note: None. Release note (<category, see below>): <what> <show> <why>

cockroach-teamcity · 2022-03-18T17:05:06Z

This change is

tbg

Looks good! I really like the general idea. I gave the test a "real review" since even though this is a POC, the test will be relatively independent of the final implementation.
For the rest of the code, I tried to keep it high level and I didn't inspect in detail some bits that I suspect we won't need anyway. The main point is that I think the destroyStatus might be the right place to store the cordoning state, as this is a pre-existing way to keep a Replica around but to essentially prevent it from getting used. Whenever we make a Replica we'd read the cordoned state, and if found, populate the destroyStatus before handing the replica to the caller. The right place to do so would be around here:

cockroach/pkg/kv/kvserver/replica_init.go

Lines 176 to 179 in 625ee8b

    
           var err error 
        
           if r.mu.state, err = r.mu.stateLoader.Load(ctx, r.Engine(), desc); err != nil { 
        
           	return err 
        
           }

As the surrounding method is always called before a replica can be used.

I know that you've talked to Erik a bunch, so it would be good to wait for his feedback as well before you act on my advice, as he has more context.

tbg · 2022-03-21T08:31:06Z

pkg/kv/kvserver/cordon_integration_test.go

+
+	ctx := context.Background()
+
+	dir, cleanupFn := testutils.TempDir(t)


If you use this Sticky stuff, you don't even need to touch disk (which makes this test a lot faster especially when stressing):

cockroach/pkg/kv/kvserver/client_metrics_test.go

Lines 254 to 264 in 208e2b4

stickyEngineRegistry := server.NewStickyInMemEnginesRegistry()

defer stickyEngineRegistry.CloseAllStickyInMemEngines()

const numServers int = 3

stickyServerArgs := make(map[int]base.TestServerArgs)

for i := 0; i < numServers; i++ {

stickyServerArgs[i] = base.TestServerArgs{

CacheSize: 1 << 20, /* 1 MiB */

StoreSpecs: []base.StoreSpec{

{

InMemory: true,

StickyInMemoryEngineID: strconv.FormatInt(int64(i), 10),

tbg · 2022-03-21T08:34:14Z

pkg/kv/kvserver/cordon_integration_test.go

+						for _, ru := range args.Req.Requests {
+							key := ru.GetInner().Header().Key
+							// The time-series range is continuously written to.
+							if bytes.HasPrefix(key, keys.TimeseriesPrefix) {


Explicitly trigger the problem from a test by special-casing a request here. For example,

if bytes.HasSuffix(key, "boom") { ... }

then you should be able to inject panics on any range you like by doing a put on append(desc.StartKey.AsRawKey(), "boom"...).

tbg · 2022-03-21T08:45:52Z

pkg/kv/kvserver/replica_send.go

+	// instead? It allow follower reads when possible. See below:
+	// https://github.com/cockroachdb/cockroach/pull/76858
+	if r.store.cordoned[_forStacks] {
+		return nil, roachpb.NewError(errors.Newf("range %v is cordoned", _forStacks))


The final code will probably want to call r.replicaUnavailableError(...) here to make a nice error.

Nice. Should we be using your circuit breaker here? See comment up above about rationale. TLDR: I guess even with two replicas cordoned, we could still serve follower reads.

tbg · 2022-03-21T08:46:52Z

pkg/kv/kvserver/scheduler.go

@@ -168,6 +168,7 @@ type raftScheduler struct {
 	processor      raftProcessor
 	latency        *metric.Histogram
 	numWorkers     int
+	cordoned       map[roachpb.RangeID]bool


This might be better as a new state for replica.mu.destroyStatus:

cockroach/pkg/kv/kvserver/replica_destroy.go

Lines 31 to 40 in 84e3fdf

const (

// The replica is alive.

destroyReasonAlive DestroyReason = iota

// The replica has been GCed or is in the process of being synchronously

// removed.

destroyReasonRemoved

// The replica has been merged into its left-hand neighbor, but its left-hand

// neighbor hasn't yet subsumed it.

destroyReasonMergePending

)

I agree, this state should be on the replica. Don't know all the implications of destroyStatus off the bat.

I'll check this out!

tbg · 2022-03-21T08:47:54Z

pkg/kv/kvserver/scheduler.go

+		// per-replica proposal quota hold back writes to the range? That is not
+		// desirable! Some relevant discussion:
+		// https://github.com/cockroachdb/cockroach/issues/77251
+		if s.cordoned[id] {


I think the destroyStatus is a good way to streamline this, as "destroyed" (for a suitable definition, which will need to be expanded) are never raft handled:

cockroach/pkg/kv/kvserver/replica_raft.go

Lines 1508 to 1514 in 33fe9d1

func (r *Replica) withRaftGroupLocked(

mayCampaignOnWake bool, f func(r *raft.RawNode) (unquiesceAndWakeLeader bool, _ error),

) error {

if r.mu.destroyStatus.Removed() {

// Callers know to detect errRemoved as non-fatal.

return errRemoved

}

Instead of Removed() we'd want a method Usable() error or something, which would either return nil, errRemoved, or errCordoned.

tbg · 2022-03-21T08:49:33Z

pkg/kv/kvserver/store.go

+	var err error
+	s.cordoned, err = replicasToCordon(ctx, eng)
+	if err != nil {
+		// TODO(josh): Is it okay to fatal here in case the read to storage fails?


I think in the final code the cordoning will be decided on a per-replica basis when it is instantiated, so you wouldn't need this code.
If we needed this code, we'd want proper error handling.

tbg · 2022-03-21T08:50:04Z

pkg/kv/kvserver/store.go

 	s.draining.Store(false)
-	s.scheduler = newRaftScheduler(cfg.AmbientCtx, s.metrics, s, storeSchedulerConcurrency)
+	s.scheduler = newRaftScheduler(cfg.AmbientCtx, s.metrics, s, storeSchedulerConcurrency, s.cordoned)


With destroyStatus in place the raft sched should not need to know about a list of cordoned ranges in the end.

tbg · 2022-03-21T08:51:59Z

pkg/kv/kvserver/store_raft.go

@@ -503,6 +505,18 @@ func (s *Store) processReady(rangeID roachpb.RangeID) {
 	}

 	ctx := r.raftCtx
+	// If panic (or fatal error since maybeFatalOnRaftReadyErr converts
+	// these into panics), we mark the replica for cordoning and crash.


This is actually a really good approach. When you recover from panics, you always have to worry about what state you're leaving the system in. Since raft processing may also acquire store mutexes and change store data structures (splits/merges), there could be cases in which we recover the panic and cordon the replica, but damage has really been done store-wide. Crashing once elegantly avoids this problem.
I think this is worth as a prominent comment somewhere in the final version as it highlights how avoiding the crash is a bigger undertaking.

tbg · 2022-03-21T08:52:35Z

pkg/kv/kvserver/store_raft.go

+}
+
+func (s *Store) markReplicaForCordoningIfPanic(ctx context.Context, rangeID roachpb.RangeID) {
+	if re := recover(); re != nil {


nit: it's nice to use early-returns to avoid nested ifs, so you could

r := recover() if r == nil { return } // rest of code

tbg · 2022-03-21T08:53:46Z

pkg/kv/kvserver/cordon_integration_test.go

+			StoreSpecs: []base.StoreSpec{{Path: dir + "/store-1"}},
+			Knobs: base.TestingKnobs{
+				Store: &kvserver.StoreTestingKnobs{
+					DontPanicOnApplyPanicOrFatalError: dontPanicOnApplyPanicOrFatalError,


The need for this awkward testing knob would disappear once we trigger the panic using a special request as suggested above.

erikgrinaker

Just dumping a quick initial review, since I have meetings the rest of the day. Overall, this makes sense, and I agree with tbg's comments. Two things:

There needs to be a mechanism to uncordon the range. The simplest would be a new cockroach CLI subcommand that deletes the cordon key.
We need to ensure the replica is properly isolated. What happens if we send a read/write request to it? Will it get scheduled in any queues? Etc.

erikgrinaker · 2022-03-21T12:25:02Z

pkg/kv/kvserver/store_raft.go

+
+			// TODO(josh): For the POC, we only allow cordoning a single range at a time.
+			c := cordonRecord{RangeID: rangeID}
+			// TODO(josh): Use proto instead of yaml.


We should put the range ID into the key as a suffix. In that case I don't think we need a serialized value at all, nil should do just fine.

If we do it the way you suggest, then wouldn't we need to GC a cordonRecord in case a range is removed? If we have a single cordonRecord per store, we maybe we can avoid this complexity? Not sure how desirable that is, but I think it's worth mentioning at least!

Also, if we stick with how I have this now, we can expand that cordonRecord structure to enable N=5 cordons per store max, by making the structure a list of range IDs.

erikgrinaker · 2022-03-21T12:26:55Z

pkg/kv/kvserver/store_raft.go

+				log.Errorf(ctx, "couldn't marshall cordon record: %v", err)
+				return
+			}
+			// TODO(josh): I don't think we need an fsync here? The process will restart,


I agree, an fsync seems unnecessary here.

💯

Is there an easy way to tell storage to NOT fsync? I can also look around myself.

erikgrinaker · 2022-03-21T12:28:18Z

pkg/kv/kvserver/store_raft.go

+			// discussion about the possibility of a read being served after a panic /
+			// error but before the process restarts. This code increases that time
+			// window, but perhaps we can skip the fsync to reduce it again.
+			if err := s.engine.PutUnversioned(keys.StoreCordonRangeKey(), mc); err != nil {


I'm a bit worried about this hanging if there is a disk-related issue that caused the apply failure. Since we don't use contexts for disk IO, we should likely spawn off a goroutine with a small timeout and continue to crash if it doesn't make it in time.

Ooo good point! Ack re: context.

erikgrinaker · 2022-03-21T12:29:34Z

pkg/kv/kvserver/scheduler.go

@@ -168,6 +168,7 @@ type raftScheduler struct {
 	processor      raftProcessor
 	latency        *metric.Histogram
 	numWorkers     int
+	cordoned       map[roachpb.RangeID]bool


I agree, this state should be on the replica. Don't know all the implications of destroyStatus off the bat.

joshimhoff · 2022-03-21T16:10:24Z

Thanks for the feedback, @erikgrinaker and @tbg, and thanks for all the help, @erikgrinaker! I'm happy the high level approach makes sense.

joshimhoff · 2022-03-21T16:16:46Z

There needs to be a mechanism to uncordon the range. The simplest would be a new cockroach CLI subcommand that deletes the cordon key.

Ack! I'd like it if the CLI tool didn't require exclusive access to the store itself, as this is a very annoying requirement in certain operational contexts (k8s so CC). So maybe it needs to send an RPC. Another idea is a cluster setting.

We need to ensure the replica is properly isolated. What happens if we send a read/write request to it? Will it get scheduled in any queues?

Re: read & writes, I think the changes to relica_send.go should be sufficient? The integration test tests this too.

Re: queues, I'll poke at that. I have no concrete mental model there ATM.

joshimhoff requested a review from a team as a code owner March 18, 2022 17:04

joshimhoff marked this pull request as draft March 18, 2022 17:05

joshimhoff mentioned this pull request Mar 18, 2022

[POC] kvserver: add an exponential backoff in case of panic during raft application #77858

Closed

joshimhoff requested review from tbg and erikgrinaker March 18, 2022 17:05

joshimhoff mentioned this pull request Mar 18, 2022

kvserver: reduce blast radius of Raft application errors #75944

Open

tbg reviewed Mar 21, 2022

View reviewed changes

erikgrinaker reviewed Mar 21, 2022

View reviewed changes

joshimhoff mentioned this pull request Mar 21, 2022

kv: capability to quarantine a range #62199

Open

joshimhoff closed this Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC] kvserver: crash once and cordon replica in case of an apply fatal error #78092

[POC] kvserver: crash once and cordon replica in case of an apply fatal error #78092

joshimhoff commented Mar 18, 2022

cockroach-teamcity commented Mar 18, 2022

tbg left a comment

tbg Mar 21, 2022

joshimhoff Mar 21, 2022

tbg Mar 21, 2022

joshimhoff Mar 21, 2022

tbg Mar 21, 2022

joshimhoff Mar 21, 2022

tbg Mar 21, 2022

erikgrinaker Mar 21, 2022

joshimhoff Mar 21, 2022

tbg Mar 21, 2022

tbg Mar 21, 2022

tbg Mar 21, 2022

tbg Mar 21, 2022

joshimhoff Mar 21, 2022

tbg Mar 21, 2022

tbg Mar 21, 2022

erikgrinaker left a comment

erikgrinaker Mar 21, 2022

joshimhoff Mar 21, 2022

erikgrinaker Mar 21, 2022

joshimhoff Mar 21, 2022

erikgrinaker Mar 21, 2022

joshimhoff Mar 21, 2022

erikgrinaker Mar 21, 2022

joshimhoff commented Mar 21, 2022 •

edited

Loading

joshimhoff commented Mar 21, 2022 •

edited

Loading

	var err error
	if r.mu.state, err = r.mu.stateLoader.Load(ctx, r.Engine(), desc); err != nil {
	return err
	}


		ctx := context.Background()

		dir, cleanupFn := testutils.TempDir(t)

	stickyEngineRegistry := server.NewStickyInMemEnginesRegistry()
	defer stickyEngineRegistry.CloseAllStickyInMemEngines()
	const numServers int = 3
	stickyServerArgs := make(map[int]base.TestServerArgs)
	for i := 0; i < numServers; i++ {
	stickyServerArgs[i] = base.TestServerArgs{
	CacheSize: 1 << 20, /* 1 MiB */
	StoreSpecs: []base.StoreSpec{
	{
	InMemory: true,
	StickyInMemoryEngineID: strconv.FormatInt(int64(i), 10),

	const (
	// The replica is alive.
	destroyReasonAlive DestroyReason = iota
	// The replica has been GCed or is in the process of being synchronously
	// removed.
	destroyReasonRemoved
	// The replica has been merged into its left-hand neighbor, but its left-hand
	// neighbor hasn't yet subsumed it.
	destroyReasonMergePending
	)

	func (r *Replica) withRaftGroupLocked(
	mayCampaignOnWake bool, f func(r *raft.RawNode) (unquiesceAndWakeLeader bool, _ error),
	) error {
	if r.mu.destroyStatus.Removed() {
	// Callers know to detect errRemoved as non-fatal.
	return errRemoved
	}

[POC] kvserver: crash once and cordon replica in case of an apply fatal error #78092

[POC] kvserver: crash once and cordon replica in case of an apply fatal error #78092

Conversation

joshimhoff commented Mar 18, 2022

cockroach-teamcity commented Mar 18, 2022

tbg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshimhoff commented Mar 21, 2022 • edited Loading

joshimhoff commented Mar 21, 2022 • edited Loading

joshimhoff commented Mar 21, 2022 •

edited

Loading

joshimhoff commented Mar 21, 2022 •

edited

Loading