storage: consistency check failed on cyan #29252

tbg · 2018-08-29T10:36:15Z

https://sentry.io/cockroach-labs/cockroachdb/issues/666877375/

*log.safeError: replica_consistency.go:142: consistency check failed with %d inconsistent replicas | int
  File "github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go", line 151, in CheckConsistency
  File "github.com/cockroachdb/cockroach/pkg/storage/consistency_queue.go", line 114, in process
  File "github.com/cockroachdb/cockroach/pkg/storage/queue.go", line 744, in processReplica
  File "github.com/cockroachdb/cockroach/pkg/storage/queue.go", line 627, in 2
  File "github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go", line 324, in func1

replica_consistency.go:142: consistency check failed with %d inconsistent replicas | int

The text was updated successfully, but these errors were encountered:

tbg · 2018-08-29T10:36:52Z

1/cockroach.cockroach-cyan-01.root.2018-08-27T18_02_25Z.060057.log

E180828 09:35:08.646984 42109141 storage/replica_consistency.go:158  [consistencyChecker,n1,s1,r32/8:/Table/5{3-4}] consistency check failed with 1 inconsistent replicas; fetching details
I180828 09:35:09.431344 1777 server/status/runtime.go:433  [n1] runtime stats: 5.2 GiB RSS, 538 goroutines, 375 MiB/247 MiB/778 MiB GO alloc/idle/total, 3.5 GiB/4.5 GiB CGO alloc/total, 14926.76cgo/sec, 1.65/0.36 %(u/s)time, 0.00 %gc (3x)
E180828 09:35:10.657683 42109141 storage/replica_consistency.go:104  [n1,s1,r32/8:/Table/5{3-4}] replica (n5,s5):7 is inconsistent: expected checksum b8f12a1959232ebe9632ecbd87c8bad38b1b52307803eeb2840b713dda044b45d8997df9ea2dffb5146358649ad121aa5adafc3af343829077b9b874d0c87c1b, got fe9c23f49f08099229f1acf8f5c01cd69accaeedc75c953cf45d1eabba422888626c39fac4b241002a4af2e6bf468c43ff30e7e0b4dd71a836eefbb8e23cacc4
--- leaseholder
+++ follower
-0.000000000,0 /Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9"
-    ts:<zero>
-    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x002\x85\x01i\xac\xeaK\x03\n:\n\x10s?\xb0\x94\xe9\xaaMg\x97LqHk\x98#\xb9\x1a\x14\x01k\x12\xbd\x89\x80\x90מR\xe7\x9b\x01D\x00\x01rdsc*\n\b\xd4\xca\xff\xc0\x95ˮ\xa7\x150\x92\x9538\x01\x12\x05merge*\n\b\xc1ϓ\xc1\x95ˮ\xa7\x152\n\b\xd4\xca\xff\xc0\x95ˮ\xa7\x15:\n\b\xd4\xca\xff\xc0\x95ˮ\xa7\x15B\x0e\b\x06\x12\n\b\xd4\xca\xff\xc0\x95ˮ\xa7\x15H\x01r\x00z\x00\x80\x01\x01"
-    raw mvcc_key/value: 016b12bd898090d79e52e79b0144000174786e2d733fb094e9aa4d67974c71486b9823b900 12040800100018002000280032850169acea4b030a3a0a10733fb094e9aa4d67974c71486b9823b91a14016b12bd898090d79e52e79b01440001726473632a0a08d4caffc095cbaea71530929533380112056d657267652a0a08c1cf93c195cbaea715320a08d4caffc095cbaea7153a0a08d4caffc095cbaea715420e0806120a08d4caffc095cbaea715480172007a00800101
E180828 09:35:10.817920 42109141 util/log/crash_reporting.go:477  [n1,s1,r32/8:/Table/5{3-4}] Reported as error 7a67fc8e1f9d436cab357453cd33ed2d

tbg · 2018-08-29T10:40:59Z

$ ./cockroach debug decode-value 016b12bd898090d79e52e79b0144000174786e2d733fb094e9aa4d67974c71486b9823b900 12040800100018002000280032850169acea4b030a3a0a10733fb094e9aa4d67974c71486b9823b91a14016b12bd898090d79e52e79b01440001726473632a0a08d4caffc095cbaea71530929533380112056d657267652a0a08c1cf93c195cbaea715320a08d4caffc095cbaea7153a0a08d4caffc095cbaea715420e0806120a08d4caffc095cbaea715480172007a00800101
0.000000000,0 /Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9": "merge" id=733fb094 key=/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor rw=true pri=0.03903592 iso=SERIALIZABLE stat=PENDING epo=0 ts=1535369415.873258836,0 orig=1535369415.873258836,0 max=1535369415.873258836,0 wto=false rop=false seq=1

so it looks like this time one of the followers has a pending merge transaction, whereas the other follower and the leader don't have it.

tbg · 2018-08-29T10:45:09Z

wait a minute, it's the other way around. Presumably leader and one follower have this pending txn, but not another follower?

tbg · 2018-08-29T11:09:32Z

There's a second, identical, consistency failure in the logs later (which makes sense - once inconsistent, twice inconsistent).

The transaction timestamp boils down to 180827 at 11:30:15, at which point we find this (note that the start key of r5911 is the anchor of the inconsistent txn record):

180827 11:30:14.986610 5269578 storage/replica_command.go:430  [merge,n6,s6,r5911/3:/Table/53/1/-{800975…-798923…}] initiating a merge of r3412:/Table/53/1/-798{9234154929088300-4682932150010947} [(n6,s6):5, (n5,s5):2, (n3,s3):4, next=6, gen=1] into this range

and that seems to go through, for we see all replicas disappear:

80827 11:30:15.022004 20152 storage/store.go:3748  [n1,s1,r3412/1:/Table/53/1/-798{9234…-4682…}] added to replica GC queue (peer suggestion)
80827 11:30:15.022514 20152 storage/store.go:3748  [n1,s1,r3412/1:/Table/53/1/-798{9234…-4682…}] added to replica GC queue (peer suggestion)
80827 11:30:15.098173 3274529 storage/store.go:2656  [replicaGC,n1,s1,r3412/1:/Table/53/1/-798{9234…-4682…}] removing replica r3412/1
80827 11:30:15.134214 3274529 storage/replica.go:880  [replicaGC,n1,s1,r3412/1:/Table/53/1/-798{9234…-4682…}] removed 11 (0+11) keys in 36ms [clear=1ms commit=35ms]
80827 11:30:16.493037 556 storage/store.go:2656  [n6,s6,r5911/3:/Table/53/1/-{800975…-798923…}] removing replica r3412/5
80827 11:30:16.496645 2253 storage/store.go:2656  [n5,s5,r5911/2:/Table/53/1/-{800975…-798923…}] removing replica r3412/2
80827 11:30:16.503067 338 storage/store.go:2656  [n3,s3,r5911/4:/Table/53/1/-{800975…-798923…}] removing replica r3412/4

shortly thereafter we see r5911 itself get eaten:

180827 11:31:49.022537 5787986 storage/replica_command.go:430  [merge,n4,s4,r982/4:/Table/53/1/-80{48444…-09759…}] initiating a merge of r5911:/Table/53/1/-{8009759333344935612-7952945758738317278} [(n6,s6):3, (n5,s5):2, (n4,s4):5, next=6, gen=5] into this range
180827 11:31:50.518419 2168 storage/store.go:2656  [n4,s4,r982/4:/Table/53/1/-80{48444…-09759…}] removing replica r5911/5
180827 11:31:50.520475 559 storage/store.go:2656  [n6,s6,r982/5:/Table/53/1/-80{48444…-09759…}] removing replica r5911/3
180827 11:31:50.524089 2264 storage/store.go:2656  [n5,s5,r982/2:/Table/53/1/-80{48444…-09759…}] removing replica r5911/2

then r982 gets eaten (note r7900s extended bounds in the last line)...

180827 11:43:11.490413 2518453 storage/replica_command.go:430  [merge,n3,s3,r7900/2:/Table/53/1/-80{77894…-48444…}] initiating a merge of r982:/Table/53/1/-{8048444779259766956-7909958576638586199} [(n1,s1):9, (n5,s5):2, (n3,s3):6, next=10, gen=8] into this range
180827 11:43:11.629098 284 storage/store.go:2656  [n3,s3,r7900/2:/Table/53/1/-80{77894…-48444…}] removing replica r982/6
180827 11:43:11.632123 2262 storage/store.go:2656  [n5,s5,r7900/4:/Table/53/1/-80{77894…-48444…}] removing replica r982/2
180827 11:43:11.681154 988 storage/store.go:2656  [n1,s1,r7900/5:/Table/53/1/-80{77894…-48444…}] removing replica r982/9
180827 11:48:14.624472 2654023 storage/replica_command.go:430  [merge,n3,s3,r7900/2:/Table/53/1/-{807789…-790995…}] initiating a merge of r7349

we can chase this down all the way, but I think it's fair to say that the damage has already been done at this point (in fact, the damage likely gets done pretty much at 180827 11:30:16).

tbg · 2018-08-29T12:30:39Z

With some newly developed options to debug keys:

$ crl-ssh cyan all './cockroach-tobias debug keys --values --enterprise-encryption=path=/mnt/data1,key=keys/active.key,old-key=keys/old.key --from hex:016b12bd898090d79e52e79b0144000174786e2d733fb094e9aa4d67974c71486b9823b900 /mnt/data1 | head -n 1
'
1: [email protected]
0.000000000,0 /Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9": "merge" id=733fb094 key=/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor rw=true pri=0.03903592 iso=SERIALIZABLE stat=PENDING epo=0 ts=1535369415.873258836,0 orig=1535369415.873258836,0 max=1535369415.873258836,0 wto=false rop=false seq=1

2: [email protected]
0.000000000,0 /Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9": "merge" id=733fb094 key=/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor rw=true pri=0.03903592 iso=SERIALIZABLE stat=PENDING epo=0 ts=1535369415.873258836,0 orig=1535369415.873258836,0 max=1535369415.873258836,0 wto=false rop=false seq=1

3: [email protected]
0.000000000,0 /Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9": "merge" id=733fb094 key=/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor rw=true pri=0.03903592 iso=SERIALIZABLE stat=PENDING epo=0 ts=1535369415.873258836,0 orig=1535369415.873258836,0 max=1535369415.873258836,0 wto=false rop=false seq=1

I verified that node i has nodeID i, so our mystery transaction appears to around on nodeIDs 1-3.

So my confusion above is confirmed: r32 is on 1,3,5 at the time of the crash, and it seems that 1 and 3 actually have this txn, it's just that it's "missing" on 5.

Obviously it's weird and a bit concerning to have that transaction open in the first place, after all, it's been more than a day and the corresponding merge doesn't seem to be active any more?

Time to look for a merge that would've been anchored on a replica on nodes 1-3. This is confusing, because as far as I can tell none of the merges I looked at so far live on these nodes. Here are a few that do, but none of them really match up with the timestamp (but there could've been retries; they're not logged). I don't really think that there's a big point in looking into these transactions. The merge above has the matching start key and time, so it's more than likely that it wrote the txn record, but not on n1-3.

$ rg -a --no-line-number --no-filename 'initiating a merge[^\d]' | cut -c '2-' | sort | grep n1 | grep n2 | grep n3 | grep -E 180827\ 11:30\|11:29
180827 11:29:36.496676 2259765 storage/replica_command.go:430  [merge,n3,s3,r2826/3:/Table/53/1/1{198203…-200426…}] initiating a merge of r10017:/Table/53/1/120{0426956837177347-2644061213040745} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=0] into this range
180827 11:29:37.717360 3255526 storage/replica_command.go:430  [merge,n1,s1,r5308/5:/Table/53/1/177{03845…-69966…}] initiating a merge of r10242:/Table/53/1/17{76996673402615308-96867429618878746} [(n3,s3):4, (n2,s2):2, (n1,s1):3, next=5, gen=3] into this range
180827 11:30:36.689365 3283728 storage/replica_command.go:430  [merge,n1,s1,r9113/1:/Table/53/1/5{577175…-615383…}] initiating a merge of r1333:/Table/53/1/56{15383261386172522-28833217815307096} [(n1,s1):1, (n3,s3):5, (n2,s2):4, next=6, gen=3] into this range
180827 11:30:46.770576 3288949 storage/replica_command.go:430  [merge,n1,s1,r10076/4:/Table/53/1/{198046…-200040…}] initiating a merge of r5286:/Table/53/1/20{00405052489571948-65584417125190863} [(n1,s1):6, (n2,s2):2, (n3,s3):5, next=7, gen=3] into this range
180827 11:30:48.147848 1867893 storage/replica_command.go:430  [merge,n5,s5,r3561/2:/Table/53/1/-67{52795…-18942…}] initiating a merge of r6439:/Table/53/1/-671{8942456141413932-6677564676282143} [(n2,s2):6, (n5,s5):2, (n3,s3):3, (n1,s1):5, next=7, gen=0] into this range

tbg · 2018-08-29T12:36:56Z

I'm seeing this

80827 11:30:16.494360 5272116 storage/replica_command.go:812  [n6,s6,r5911/3:/Table/53/1/-{800975…-798468…}] change replicas (ADD_REPLICA (n4,s4):5): read exis
ting descriptor r5911:/Table/53/1/-{8009759333344935612-7984682932150010947} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=4]
80827 11:30:16.496143 5700331 storage/queue.go:788  [merge,n4,s4,r982/4:/Table/53/1/-80{48444…-09759…}] while adding target n4,s4: change replicas of r5911 fai
led: descriptor changed: [expected] r5911:/Table/53/1/-{8009759333344935612-7989234154929088300} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=3] != [actual] r
5911:/Table/53/1/-{8009759333344935612-7984682932150010947} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=4]: unexpected value: raw_bytes:"\317A\241B\003\010\2
27.\022\013\275\211\200\220\327\236R\347\233\001D\032\013\275\211\200\2210\265-\211\022\373\275\"\006\010\006\020\006\030\003\"\006\010\005\020\005\030\002\"\0
06\010\003\020\003\030\004(\0050\004" timestamp:<wall_time:1535369415873258836 >

shortly before 11:30:16 which tells me that mostly likely during the problematic merge transaction at 11:30:14 r5911 had descriptor [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=X].

But isn't something backwards here:

[expected] r5911:/Table/53/1/-{8009759333344935612-7989234154929088300} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=3] != [actual] r5911:/Table/53/1/-{8009759333344935612-7984682932150010947} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=4]:

Check out the end keys, the expected one is larger than the actual one, but the expected one has a smaller generation than the actual one:

7989234154929088300 # exp
7984682932150010947 # act

I'm not finding anything about any splits in these logs.

tbg · 2018-08-29T12:46:00Z

Some of my analysis above might be muddled. I'm just realizing that on 180827, we didn't have #29079 yet (the SHA was e7d570e), though we did have #29067.

Either way, the problem here can't be pinned on the consistency checker. We have these keys in three places, but these three places aren't the three places where it should (or should not) be.

tbg · 2018-08-29T12:46:55Z

@benesch I've got to run out for a while (and I've also stared at this for too long). Time for you to hopefully figure it out :-)

This was used in cockroachdb#29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. Release note: None

29328: cli: add hex option to debug keys r=a-robinson a=tschottdorf This was used in #29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

This was used in cockroachdb#29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. Release note: None

29390: backport-2.1: storage, roachtest, cli: assorted backports r=petermattis a=tschottdorf storage: remove test-only data race The test was stopping the engine before the stopper, so the compactor was sometimes able to use the engine while it was being closed. Fixes #29302. roachtest: improve TestMonitor Add two more test cases about the exit status of the `monitor` invocation itself. roachtest: fix flake in TestMonitor Accidentally changed this in #29178. cli: add hex option to debug keys This was used in #29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. storage: skip TestClosedTimestampCanServe It won't be the first one I'm looking into. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

tbg · 2018-09-05T15:57:22Z

$ ./cockroach debug range-data --enterprise-encryption=path=/mnt/data1,key=keys/active.key,old-key=keys/old.key --replicated /mnt/data1/ 32 > r32dump.delme
Error: value type is not BYTES: UNKNOWN
Failed running "debug range-data"

Hmm, I assume this is some bug in our mostly untested debug tools. Looking.

tbg · 2018-09-05T16:23:10Z

Oh, great. Something is broken unmarshaling keys of the following form. I doubt those are actual range descriptors, they probably just look enough like them.

$ ./cockroach-tobias debug range-data --enterprise-encryption=path=/mnt/data1,key=keys/active.key,old-key=keys/old.key --replicated /mnt/data1/ 32
Error: unmarshaling range descriptor at key {Key:/Local/Range/System/tsd/cr.node.sql.mem.sql.max-p75/4/10s/2018-06-18T21:00:00Z/RangeDescriptor/1535369258.738346692,0 Value:[]}: value type is not BYTES: UNKNOWN
Failed running "debug range-data"

tbg · 2018-09-05T16:44:34Z

Working around this error, we can see the history of range descriptors on r32 on node 1's dump.

First of all, here's the offending txn again:

/Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9"
0.000000000,0 /Local/Range/Table/53/1/-8009759333344935612/Transaction/"733fb094-e9aa-4d67-974c-71486b9823b9": "merge" id=733fb094 key=/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor rw=true pri=0.03903592 iso=SERIALIZABLE stat=PENDING epo=0 ts=1535369415.873258836,0 orig=1535369415.873258836,0 max=1535369415.873258836,0 wto=false rop=false seq=1

The following range descriptors are newer than that transaction:

/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535369509.778111977,0
1535369509.778111977,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: ""

/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535369508.540458480,0
1535369508.540458480,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-7952945758738317278)
        Raw:r5911:/Table/53/1/-{8009759333344935612-7952945758738317278} [(n6,s6):3, (n5,s5):2, (n4,s4):5, next=6, gen=5]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535369507.891083783,0
1535369507.891083783,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-7952945758738317278)
        Raw:r5911:/Table/53/1/-{8009759333344935612-7952945758738317278} [(n6,s6):3, (n5,s5):2, (n3,s3):4, (n4,s4):5, next=6, gen=5]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535369467.773388533,0
1535369467.773388533,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-7952945758738317278)
        Raw:r5911:/Table/53/1/-{8009759333344935612-7952945758738317278} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=5]

The above basically says that r32 moved from (n6,n5,n3) to (n6,n5,n4).

Now the interesting range descriptor which must have been written by the offending transactions (since it's a merge and the timestamps match):

/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535369415.873258836,0
1535369415.873258836,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-7984682932150010947)
        Raw:r5911:/Table/53/1/-{8009759333344935612-7984682932150010947} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=4]

So while located on (n6,n5,n3), the range subsumed a neighbor. The fact that the pending txn record and this committed value exist is already - to me - impossible to explain.

The following descriptors are older. Looking at them in chronological order (i.e. reverse from what's here) the range first lived on (n2,n5,n6), moved to (n6,n5,n3), and then subsumed its right neighbor three times.

/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535368210.638844179,0
1535368210.638844179,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-7989234154929088300)
        Raw:r5911:/Table/53/1/-{8009759333344935612-7989234154929088300} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=3]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535356708.050223658,0
1535356708.050223658,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-8002918179127402730)
        Raw:r5911:/Table/53/1/-800{9759333344935612-2918179127402730} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=2]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1535356402.383522155,0
1535356402.383522155,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-8005204013617227225)
        Raw:r5911:/Table/53/1/-800{9759333344935612-5204013617227225} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=1]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1534252593.780488949,0
1534252593.780488949,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-8007483484511996492)
        Raw:r5911:/Table/53/1/-800{9759333344935612-7483484511996492} [(n6,s6):3, (n5,s5):2, (n3,s3):4, next=5, gen=0]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1534252590.540289140,0
1534252590.540289140,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-8007483484511996492)
        Raw:r5911:/Table/53/1/-800{9759333344935612-7483484511996492} [(n6,s6):3, (n5,s5):2, next=4, gen=0]


/Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor/1530909478.560743969,0
1530909478.560743969,0 /Local/Range/Table/53/1/-8009759333344935612/RangeDescriptor: [/Table/53/1/-8009759333344935612, /Table/53/1/-8007483484511996492)
        Raw:r5911:/Table/53/1/-800{9759333344935612-7483484511996492} [(n2,s2):1, (n5,s5):2, (n6,s6):3, next=4, gen=0]

tbg · 2018-09-05T17:14:59Z

Oh, great. Something is broken unmarshaling keys of the following form

I bet these are just the range descriptors for ranges that have been merged away.

tbg · 2018-09-06T17:00:58Z

I don't see where the operation would wait for the RHS to have caught up. If the n9 replica of r10033 hasn't caught up to the merge yet, what in SubsumeRequest would do it? That request seems to only flush out any pending writes on the leaseholder, but it doesn't wait for the followers to be up to date (I thought it did - am I missing this)?

benesch · 2018-09-06T17:04:47Z

The SubsumeRequest doesn't wait; that's done later in the merge transaction:

cockroach/pkg/storage/replica_command.go

Lines 528 to 531 in acc8f83

    
           err := waitForApplication(ctx, r.store.cfg.NodeDialer, rightDesc, rhsSnapshotRes.LeaseAppliedIndex) 
        
           if err != nil { 
        
           	return errors.Wrap(err, "waiting for all right-hand replicas to catch up") 
        
           }

tbg · 2018-09-06T17:24:09Z

Good point. I looked at the server-side impl of that request and one thing that stood out is that it's not synchronizing with the raft mu (just the replica mutex). Looking at how this is updated, that shouldn't make a difference, though - we write the data first, and then bump the lease applied index (both on regular application and snapshots).

I also looked into interleavings in which somehow the second merge could use a stale descriptor that simply wouldn't be checking n9, but found no way that this could happen. The second merge places intents on both range descriptors before doing anything critical, as it should. I still think this is somehow where the bug is, though perhaps I'm wrong.

benesch · 2018-09-06T17:44:05Z

Extracted and interleaved logs from all nodes: https://gist.github.com/benesch/bbdb986899ca5e972ce5576149b88fbe

tbg · 2018-09-06T17:44:59Z

The timing of these snapshots is instructive.

$ grep -E 'r9836|r10033' logs/cockroach.log | less

I180906 16:06:26.250154 162811 storage/replica_raftstorage.go:784  [n9,s9,r10033/?:{-}] applying preemptive snapshot at index 79 (id=63e9e1ae, encoded size=20413, 1 rocksdb batches, 56 log entries)
I180906 16:06:26.256535 162811 storage/replica_raftstorage.go:790  [n9,s9,r10033/?:/Table/53/1/316{27334-60422}] applied preemptive snapshot in 6ms [clear=0ms batch=0ms entries=0ms commit=5ms]
I180906 16:06:58.705516 163509 storage/replica_raftstorage.go:784  [n9,s9,r10033/?:/Table/53/1/316{27334-60422}] applying preemptive snapshot at index 84 (id=82de396e, encoded size=36673, 1 rocksdb batches, 61 log entries)
I180906 16:06:58.715247 163509 storage/replica_raftstorage.go:790  [n9,s9,r10033/?:/Table/53/1/31{660422-753605}] applied preemptive snapshot in 10ms [clear=1ms batch=0ms entries=0ms commit=8ms]
I180906 16:06:58.917969 223 storage/store.go:2717  [n9,s9,r9836/2:/Table/53/1/31{599388-627334}] removing replica r10033/5
E180906 16:06:58.928164 163478 storage/stores_server.go:68  consistency check failed on range r9836: expected checksum 14158912b68d5426c90eb8f9791794efdfb153d79a0fccb96f6981ad0bee30db87abe8cf45dcb2c41fa1ee38163cfde32066eb986d1e812354947dfc699deaf7, got b2de02b49291ab8fc38f572f8786ab5cc9a1d71ca0f9787faa14c23763d8ed9a0d9fb730b7896dd97d5dfffc3627642102137471bd4daa067ff592290920a3c4

We see a first preemptive snapshot for r10033 right when r10033 is merging its neighbor, which is probably a first attempt at RelocateRange, and then a second one right before the crash (which is also pretty precisely when the second merge happened).

benesch · 2018-09-06T18:04:23Z

Well well well. n9 is getting a preemptive snapshot for the right start key... but it applies the snapshot with a totally wrong start key:

I180906 16:06:58.705516 163509 storage/replica_raftstorage.go:784  [n9,s9,r10033/?:/Table/53/1/316{27334-60422}] applying preemptive snapshot at index 84 (id=82de396e, encoded size=36673, 1 rocksdb batches, 61 log entries)
I180906 16:06:58.715247 163509 storage/replica_raftstorage.go:790  [n9,s9,r10033/?:/Table/53/1/31{660422-753605}] applied preemptive snapshot in 10ms [clear=1ms batch=0ms entries=0ms commit=8ms]

benesch · 2018-09-06T18:10:27Z

Specifically it is somehow using its old end key (from the first snapshot) as the start key. o_O

tbg · 2018-09-06T18:10:32Z

31660422 is the end key of r10033 before it ate its right neighbor, 31753605 is after.
31627334 is its real start key.

So somehow r10033 transformed itself into the right neighbor it subsumed via a snapshot.

tbg · 2018-09-06T18:11:01Z

I'm honestly a little surprised that a consistency check is where this fails. Shouldn't some assertion fire?

benesch · 2018-09-06T18:11:47Z

Replica.setDescWithoutProcessUpdate doesn't look at the descriptor you hand it at all. It just blindly accepts it.

benesch · 2018-09-06T18:12:12Z

Store.AssertInvariants() would catch it, but that's only called in tests.

tbg · 2018-09-06T18:14:31Z

So is it likely that this snapshot

[n3,s3,r10033/1:/Table/53/1/31{627334-753605}] streamed snapshot to (n9,s9):?: kv pairs: 275, log entries: 61, rate-limit: 2.0 MiB/sec, 2ms

which seems to reflect the correct post-first-merge r10033, somehow addressed only the subsumed portion? Or is the error on the receiving end?

Noticed during cockroachdb#29252. Release note: None

benesch · 2018-09-06T19:24:12Z

Ok, my mind is blown:

I180906 19:15:08.684945 407329 storage/store_snapshot.go:615  [n10,s10,r21820/3:/Table/53/1/620{38265-46570}] sending preemptive snapshot 69a0384f at applied index 40 with desc r21820:/Table/53/1/620{38265-46570} [(n5,s5):1, (n9,s9):4, (n10,s10):3, next=5, gen=1]
I180906 19:15:08.788794 407329 storage/store_snapshot.go:657  [n10,s10,r21820/3:/Table/53/1/620{38265-46570}] streamed snapshot to (n3,s3):?: kv pairs: 28, log entries: 25, rate-limit: 2.0 MiB/sec, 104ms
I180906 19:15:08.825122 196788 storage/store.go:3555  [n3,s3,r21820/2:/Table/53/1/620{38265-40026}] processing snapshot request (preemptive? true) with desc r21820:/Table/53/1/6204{0026-6570} [(n5,s5):1, (n9,s9):4, (n10,s10):3, next=5, gen=1]
I180906 19:15:08.825281 196788 storage/replica_raftstorage.go:787  [n3,s3,r21820/2:/Table/53/1/620{38265-40026}] applying preemptive snapshot at index 40 (id=69a0384f, encoded size=13907829, 1 rocksdb batches, 25 log entries)
I180906 19:15:08.985761 196788 storage/replica_raftstorage.go:793  [n3,s3,r21820/2:/Table/53/1/6204{0026-6570}] applied preemptive snapshot in 160ms [clear=0ms batch=0ms entries=152ms commit=8ms]

The outgoing snapshot looks correct and yet the incoming snapshot has the busted range descriptor! o_O

tbg · 2018-09-06T19:53:41Z

Are you trusting this code?

https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/replica_raftstorage.go#L500-L535

In particular, there's snapData.RangeDescriptor vs state.mu.Desc. I wonder if those diverge.

benesch · 2018-09-06T19:54:38Z

HA! @nvanbenschoten managed to spot the bug in like 30s of looking at this code. It's here:

cockroach/pkg/storage/store_snapshot.go

Line 412 in 6ff2995

desc.StartKey = existingDesc.EndKey

We're just mutating the header range descriptor directly instead of making a copy of it. 🤦‍♂️

tbg · 2018-09-06T19:54:40Z

Hmm, rsl.Load just clones the desc you pass in. So if they diverged, it'd be in a really weird way.

tbg · 2018-09-06T19:56:37Z

We're just mutating the header range descriptor directly instead of making a copy of it. 🤦‍♂️

❤️ ❤️ ❤️

Glad this got done so quickly.

…hots Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes cockroachdb#29252. Release note: None

@benesch

29646: roachtest: better merge testing in clearrange r=benesch a=tschottdorf Unfortunately, the method to determine the range count is quite slow since crdb_internal.ranges internally sends an RPC for each range to determine the leaseholder. Anecdotally, I've seen ~25% of the merges completed after less than 15 minutes. I know that it's slowing down over time, but @benesch will fix that. Also throws in aggressive consistency checks so that when something goes out of sync, we find out right there. Release note: None 29677: storage: preserve consistency when applying widening preemptive snapshots r=benesch a=benesch Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes #29252. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Nikhil Benesch <[email protected]>

…hots Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes cockroachdb#29252. Release note: None

We saw a consistency failure in cockroachdb#29252 that would've been much more useful had it occurred close to the time around which the inconsistency must have been introduced. Instead of leaving it to chance, add a switch that runs aggressive checks in (roach) tests that want them such as the clearrange test. Release note: None

…hots Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes cockroachdb#29252. Release note: None

We saw a consistency failure in cockroachdb#29252 that would've been much more useful had it occurred close to the time around which the inconsistency must have been introduced. Instead of leaving it to chance, add a switch that runs aggressive checks in (roach) tests that want them such as the clearrange test. Release note: None

tbg changed the title *log.safeError: replica_consistency.go:142: consistency check failed with %d inconsistent replicas | int storage: consistency check failed on cyan Aug 29, 2018

tbg assigned benesch Aug 29, 2018

tbg added the S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. label Aug 29, 2018

tbg added this to the 2.1 milestone Aug 30, 2018

tbg added the A-kv-replication Relating to Raft, consensus, and coordination. label Aug 30, 2018

tbg mentioned this issue Aug 30, 2018

cli: add hex option to debug keys #29328

Merged

tbg added a commit to tbg/cockroach that referenced this issue Aug 30, 2018

cli: add hex option to debug keys

8aca169

This was used in cockroachdb#29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. Release note: None

tbg added a commit to tbg/cockroach that referenced this issue Aug 30, 2018

cli: add hex option to debug keys

d0d9cc5

This was used in cockroachdb#29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. Release note: None

tbg mentioned this issue Aug 30, 2018

backport-2.1: storage, roachtest, cli: assorted backports #29390

Merged

tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 4, 2018

tbg self-assigned this Sep 4, 2018

tbg mentioned this issue Sep 5, 2018

storage: pick up upstream fix for Raft entries handling #28918

Closed

tbg mentioned this issue Sep 6, 2018

teamcity: failed tests on master: testrace/TestImportData, test/TestImportData #29106

Closed

tbg added a commit to tbg/cockroach that referenced this issue Sep 6, 2018

cli: handle merged range descriptors in debug keys

1be9890

Noticed during cockroachdb#29252. Release note: None

benesch mentioned this issue Sep 6, 2018

storage: preserve consistency when applying widening preemptive snapshots #29677

Merged

craig bot closed this as completed in #29677 Sep 7, 2018

benesch mentioned this issue Sep 7, 2018

release-2.1: storage: preserve consistency when applying widening preemptive snapshots #29694

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: consistency check failed on cyan #29252

storage: consistency check failed on cyan #29252

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018 •

edited

Loading

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018

tbg commented Sep 5, 2018

tbg commented Sep 5, 2018 •

edited

Loading

tbg commented Sep 5, 2018

tbg commented Sep 5, 2018

tbg commented Sep 6, 2018

benesch commented Sep 6, 2018

tbg commented Sep 6, 2018

benesch commented Sep 6, 2018

tbg commented Sep 6, 2018

benesch commented Sep 6, 2018

benesch commented Sep 6, 2018

tbg commented Sep 6, 2018

tbg commented Sep 6, 2018

benesch commented Sep 6, 2018

benesch commented Sep 6, 2018

tbg commented Sep 6, 2018

benesch commented Sep 6, 2018

tbg commented Sep 6, 2018

benesch commented Sep 6, 2018

tbg commented Sep 6, 2018

tbg commented Sep 6, 2018

storage: consistency check failed on cyan #29252

storage: consistency check failed on cyan #29252

Comments

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018 • edited Loading

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018

tbg commented Aug 29, 2018

tbg commented Sep 5, 2018

tbg commented Sep 5, 2018 • edited Loading

tbg commented Sep 5, 2018

tbg commented Sep 5, 2018

tbg commented Sep 6, 2018

benesch commented Sep 6, 2018

tbg commented Sep 6, 2018

benesch commented Sep 6, 2018

tbg commented Sep 6, 2018

benesch commented Sep 6, 2018

benesch commented Sep 6, 2018

tbg commented Sep 6, 2018

tbg commented Sep 6, 2018

benesch commented Sep 6, 2018

benesch commented Sep 6, 2018

tbg commented Sep 6, 2018

benesch commented Sep 6, 2018

tbg commented Sep 6, 2018

benesch commented Sep 6, 2018

tbg commented Sep 6, 2018

tbg commented Sep 6, 2018

tbg commented Aug 29, 2018 •

edited

Loading

tbg commented Sep 5, 2018 •

edited

Loading