-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpcc/mixed-headroom/n5cpu16 failed #39460
Comments
SHA: https://github.com/cockroachdb/cockroach/commits/e8faca611a902766154ed82581d6d3a7483ad231 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1460982&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/66bd279c9aa682c2b7adcec87ec0c639b8039a33 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1461635&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/e8faca611a902766154ed82581d6d3a7483ad231 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1462518&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/d51fa78ff90a113c9009d263dfaf58d3672670a6 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1463583&tab=buildLog
|
I'm seeing two major failure modes here:
This is unexpected because we never expect a raw 19.2: cockroach/pkg/storage/replica_command.go Lines 361 to 372 in 53e6089
19.1: cockroach/pkg/storage/replica_command.go Lines 319 to 333 in 6d189c5
The second failure is that this assertion gets hit when a preemptive snapshot gets sent by a 19.1 node to a 19.2 node:
That seems relatively straightforward to fix, if there isn't some unexpected snag. |
I have a theory for the Lease crash. I think that when a preemptive snapshot is sent from a 19.2 node to a 19.1 node, the snapshot disk data contains a Lease whose ReplicaDescriptor likely has the replica type VOTER_FULL (instead of the equivalent, deprecated, nil). Now if that old node sends a snapshot to a new node, it will send it with a descriptor that has the lease replica's type unspecified, i.e. nil. But the corresponding disk state does have a value that's non-nil, boom. This is the sort of thing the Still have to verify that this is the case. I'm more mystified by the split error, which I can't explain yet. |
I think this fixes the lease crash: diff --git a/pkg/storage/replica_raftstorage.go b/pkg/storage/replica_raftstorage.go
index bf3a7bf687..f9bd4ee50d 100644
--- a/pkg/storage/replica_raftstorage.go
+++ b/pkg/storage/replica_raftstorage.go
@@ -740,6 +740,7 @@ func (r *Replica) applySnapshot(
if s.Desc.RangeID != r.RangeID {
log.Fatalf(ctx, "unexpected range ID %d", s.Desc.RangeID)
}
+ inSnap.State = nil // prevent accidental reuse below
// Strip the DeprecatedTxnSpanGCThreshold. We don't care about it.
// TODO(nvanbenschoten): Remove in 20.1.
@@ -913,6 +914,25 @@ func (r *Replica) applySnapshot(
// has not yet been updated. Any errors past this point must therefore be
// treated as fatal.
+ // Re-load the ReplicaState from disk. The state so far was handed to us by
+ // some remote node that may not have been able to correctly represent the
+ // on-disk state. For an example of this, consider the ReplicaState.Type
+ // field which was added in 19.2. A 19.1 node receiving a 19.2 snapshot
+ // could have an on-disk ReplicaDescriptor with a non-nil type, but in
+ // memory this would not be represented, and thus encoded on the wire as
+ // the nil value. If that 19.1 node later sent a snapshot to a 19.2 node,
+ // it would send the node a nil type in inSnap.State, but the on-disk data
+ // would represent the corresponding value, which would trip assertions.
+ // See https://github.com/cockroachdb/cockroach/issues/39460 for details.
+ {
+ sl := stateloader.Make(s.Desc.RangeID)
+ var err error
+ s, err = sl.Load(ctx, r.store.Engine(), s.Desc)
+ if err != nil {
+ return errors.Wrap(err, "loading state")
+ }
+ }
+
if err := r.clearSubsumedReplicaInMemoryData(ctx, subsumedRepls, subsumedNextReplicaID); err != nil {
log.Fatalf(ctx, "failed to clear in-memory data of subsumed replicas while applying snapshot: %+v", err)
}
|
Oh, regarding the failed splits -- this is unfortunately very likely just the whole CPut mess that wasn't fixed on the 19.x branch: #38302 The problem here is that once a descriptor has a 19.2-specific field, 19.1 nodes will always fail their cput: func TestFoo(t *testing.T) {
s := []byte("\035\244\226 \003\010\205\023\022\t\300\211\367\001I\367\2205\210\032\010\300\211\367\001`\367\\\250\"\006\010\003\020\003\030\001\"\010\010\001\020\001\030\002 \000\"\006\010\004\020\004\030\003(\0040\025")
v := Value{RawBytes: s}
var desc RangeDescriptor
require.NoError(t, v.GetProto(&desc))
t.Error(pretty.Sprint(&desc))
} the 19.1 node sees:
but 19.2 reveals that the proto really has a replica with type
On the plus side, the diff in my last post seems to have resolved the assertion crashes. |
Going to run 10 instances in which I modified the |
They're all passing the import now but then immediately going bust with errors such as
paging @jordanlewis |
I think this would be caused by the server returning 4 bytes for a 4 byte integer instead of 8 like we used to. Does this use fixtures import or fixtures load? What's the data source? My guess is that we just need to update the test to use |
Hm, though it looks like all of the schemas say integer/int8. I'm not sure quite how you're reproducing this @tbg - let me know please, I can't tell if I need some non-master bulid. |
Looking at the second one through a 19.1 ui (n1): Looks like things are pretty grim. This node somehow can't even figure out the replicas in the descriptor? If I open the same page through n2, it still looks unhappy, but not as broken: predictably the logs are full of messages related to the fact that the range descriptor has turned to garbage. |
./cockroach debug range-descriptors seems to give out reasonable data both on the 19.1 and 19.2 nodes: https://gist.github.com/tbg/bd89db53e7b1e852ffca27f758800567 I can see from http://tobias-1567451338-04-n5cpu16-0002.roachprod.crdb.io:26258/_status/raft that the 19.2 nodes have this populated properly in-memory, but the 19.1 nodes return an empty Phew, whatever is going on here, things are majorly broken. |
@jordan here's a repro, I extended that cluster to 24h
Note that this one works:
This is weird to me because I initially thought that I probably messed up the workload binary, but that doesn't seem to be it; I replaced it with one built fresh from my branch (i.e. master). |
I'd hazard a guess that this explains something:
n2 (19.2) runs a change replicas, and it looks like it's actually proposed on n3 (it probably has the lease). n3 is a 19.1 node, and it looks like it botches things pretty badly the "updated=[] next=0" indicates (we're supposed to see the new replica set and nextreplicaid there). I'm seeing this in the other broken cluster too (the -03 one):
Now to figure out what this is. Pretty sure I must have broken this; I refactored a number of things about the change replicas trigger recently. |
^- ok I think the problem is actually not of my own doing, but it's 52333d1 That commit "migrated" the Fix should be easy, I'll add something to my branch. |
@jordanlewis it seems to be working now (i.e. no more int4/int8 errors). It's relatively likely that I used some old workload binary initially to restore the stuff, so I'll blame it on that. |
Even better, the imports all went through without a hitch. I might get lucky and actually get all these tests to pass, who knows? |
They all passed. |
SHA: https://github.com/cockroachdb/cockroach/commits/1051f376924f18c5782887ed824ab5ede5027493
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1429376&tab=buildLog
The text was updated successfully, but these errors were encountered: