loqrecovery: tolerates descriptor changes that causes nodes to panic after recovery #91271
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Failures in the test #91016 were caused by nodes panicing immediately after restarting.
The cause of panic is:
and is a result of descriptor change that was written by LOQ recovery tool.
Example use case for the crash:
We have stores 1,2,3,4 with replicas on 1,2,3. There's a pending descriptor change to add a learner to node 4. Nodes 2 and 3 are killed. Recovery proceeds and picks 1 despite it having an unapplied descriptor change. Update removes all other replicas from descriptor and bumps replica ID from 4 to 15. Node restarts and tries to apply committed log with descriptor change which tries to change replica ID back to its previous value.
This failure is allowed by the following piece of code
cockroach/pkg/kv/kvserver/loqrecovery/plan.go
Lines 505 to 514 in 2675c7c
where we allow certain uncommitted descriptor changes in raft log treating them as safe while in fact they are not.
Jira issue: CRDB-21183
The text was updated successfully, but these errors were encountered: