Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loqrecovery: tolerates descriptor changes that causes nodes to panic after recovery #91271

Closed
aliher1911 opened this issue Nov 4, 2022 · 4 comments · Fixed by #91388
Closed
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@aliher1911
Copy link
Contributor

aliher1911 commented Nov 4, 2022

Failures in the test #91016 were caused by nodes panicing immediately after restarting.

The cause of panic is:

I221031 08:35:30.595993 171 1@kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 49  the server is terminating due to a fatal error (see the DEV channel for details)
W221031 08:35:30.603964 222 kv/txn.go:705 ⋮ [intExec=‹protectedts-GetMetadata›] 51  failure aborting transaction: ‹node unavailable; try another peer›; abort caused by: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] can't determine lease status of (n1,s1):1 due to node liveness error: liveness record not found in cache; r31: replica (n4,s4):3 not lease holder; lease holder unknown›
W221031 08:35:30.604248 221 kv/txn.go:705 ⋮ [-] 52  failure aborting transaction: ‹node unavailable; try another peer›; abort caused by: failed to fetch protectedts metadata: failed to read metadata: ‹protectedts-GetMetadata›: context canceled
I221031 08:35:30.604665 249 kv/kvserver/liveness/liveness.go:767 ⋮ [n5,liveness-hb] 53  unable to get liveness record from KV: unable to get liveness: aborted during DistSender.Send: context canceled
E221031 08:35:30.604644 416 kv/kvserver/queue.go:1127 ⋮ [n5,replicaGC,s5,r38/15:‹/Table/3{7-8}›] 54  ‹failed to send RPC: sending to all replicas failed; last error: rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.1.50:48606->10.142.0.155:26257: use of closed network connection›
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50  attempted to change replica's ID from 15 to 4
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !goroutine 171 [running]:
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x1)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0x89
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc000ce71c0, {{{0xc0004b4e10, 0x24}, {0x527d379, 0x1}, {0x0, 0x0}, {0x0, 0x0}}, 0x17231a6a39d10e7d, ...})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/util/log/clog.go:262 +0x97
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal({0x6081c08, 0xc0037048d0}, 0x2, 0x4, 0x0, 0x0?, {0x52d1051, 0x2e}, {0xc003633730, 0x2, ...})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/util/log/channels.go:106 +0x645
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(...)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/util/log/channels.go:39
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/util/log/log_channels_generated.go:848
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked(0xc002ebec00, {0x6081c08, 0xc0037048d0}, 0xc0035af730)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:329 +0x410
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescRaftMuLocked(0xc002ebec00, {0x6081c08, 0xc0037048d0}, 0x0?)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:297 +0x97
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleDescResult(...)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_result.go:270
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult(0xc002ebed00, {0x6081c08, 0xc0037048d0}, 0xc000d69040)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:1381 +0x59e
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects(0xc002ebed00, {0x6081c08, 0xc0037048d0}, {0x60c1ab8?, 0xc000d69008})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:1245 +0x6b9
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCheckedCmdIter({0x7f55585f9c10?, 0xc002ebf0a8}, 0xc003634948)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:206 +0x158
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch(0xc00334afd0, {0x6081c08, 0xc002ec5170}, {0x60b0e58, 0xc002ebf048})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:290 +0x1f8
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries(0xc003634fd0, {0x6081c08, 0xc002ec5170})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:246 +0x9a
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(_, {_, _}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1038 +0x1c05
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady(_, {_, _}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:657 +0x17b
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady(0xc001a79500, 0xc000a5c7e0?)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:641 +0xb8
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker(0xc002b3fa40, {0x6081c08, 0xc002f59d40})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:308 +0x25d
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:489 +0x146
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:480 +0x43b
Note the real case in logs has 4 voters and one added learner, but it doesn't make much difference for this case.

and is a result of descriptor change that was written by LOQ recovery tool.

Example use case for the crash:

We have stores 1,2,3,4 with replicas on 1,2,3. There's a pending descriptor change to add a learner to node 4. Nodes 2 and 3 are killed. Recovery proceeds and picks 1 despite it having an unapplied descriptor change. Update removes all other replicas from descriptor and bumps replica ID from 4 to 15. Node restarts and tries to apply committed log with descriptor change which tries to change replica ID back to its previous value.

This failure is allowed by the following piece of code

case loqrecoverypb.DescriptorChangeType_ReplicaChange:
// Check if our own replica is being removed as part of descriptor
// change.
_, ok := change.Desc.GetReplicaDescriptor(rankedDescriptors.storeID())
if !ok {
problems = append(problems, rangeReplicaRemoval{
rangeID: rankedDescriptors.rangeID(),
span: rankedDescriptors.span(),
})
}

where we allow certain uncommitted descriptor changes in raft log treating them as safe while in fact they are not.

Jira issue: CRDB-21183

@aliher1911 aliher1911 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Nov 4, 2022
@aliher1911 aliher1911 self-assigned this Nov 4, 2022
@blathers-crl
Copy link

blathers-crl bot commented Nov 4, 2022

cc @cockroachdb/replication

@aliher1911
Copy link
Contributor Author

@tbg what do you think? This issue replaces previous discussion I started in test failure issue that is not closed.

@tbg
Copy link
Member

tbg commented Nov 7, 2022

We probably need to disallow them all. Or, maybe the better avenue to pursue is to reset the raft log?
Note that in the (near-ish) future we might also be able to apply unapplied entries offline: #75729

For "today" it seems pragmatic to bail out on all unapplied desc changes. It is sort of sad to know that forcing it through anyway will only do more harm, though, so if removing the log is easy (it should be) maybe we should do that.

@erikgrinaker erikgrinaker changed the title loss of quorum recovery tolerates descriptor changes that causes nodes to panic after recovery loqrecovery: tolerates descriptor changes that causes nodes to panic after recovery Nov 7, 2022
@aliher1911
Copy link
Contributor Author

I think making a change to disallow it makes sense for now.
To remove the entry, we need to update the change plan to include this info. It might be trivial, but that would require changing protobufs for plan or redo all the search when doing application. I think adding an action to remove part of log makes more sense. We can do that as part of our half online approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants