storage: panic: tocommit(28) is out of range [lastIndex(0)] in `StartTestCluster` #20764

tbg · 2017-12-15T23:13:12Z

This happened in #20625. Our CI harness tries to figure out new/changed tests and stress-tests for five minutes. This resulted in the following crash that I haven't been able to reproduce in >2h of stress runs, though the code that caused it was essentially

func TestAdminAPITableStats(t *testing.T) {
	const nodeCount = 3
	tc := testcluster.StartTestCluster(t, nodeCount, base.TestClusterArgs{
		ReplicationMode: base.ReplicationAuto,
		ServerArgs: base.TestServerArgs{
			ScanInterval:    time.Millisecond,
			ScanMaxIdleTime: time.Millisecond,
		},
	})
}

What I see in the logs is a bit confusing. The range in question is r3 and it has the following activity:

1. n1 generates a preemptive snapshot and streams it to n3.

[23:34:06]W:	 [Step 3/4] I171211 23:34:05.732544 338 storage/replica_raftstorage.go:527  [replicate,n1,s1,r3/1:/System/NodeLiveness{-Max}] generated preemptive snapshot 3fdd03e7 at index 25
[23:34:06]W:	 [Step 3/4] I171211 23:34:05.744716 338 storage/store.go:3552  [replicate,n1,s1,r3/1:/System/NodeLiveness{-Max}] streamed snapshot to (n3,s3):?: kv pairs: 13, log entries: 15, rate-limit: 8.0 MiB/sec, 12ms
[23:34:06]W:	 [Step 3/4] I171211 23:34:05.745142 1229 storage/replica_raftstorage.go:733  [n3,s3,r3/?:{-}] applying preemptive snapshot at index 25 (id=3fdd03e7, encoded size=6742, 1 rocksdb batches, 15 log entries)
[23:34:06]W:	 [Step 3/4] I171211 23:34:05.745415 1229 storage/replica_raftstorage.go:739  [n3,s3,r3/?:/System/NodeLiveness{-Max}] applied preemptive snapshot in 0ms [clear=0ms batch=0ms entries=0ms commit=0ms]

2. n1 starts a replica change that adds n3.

[23:34:06]W:	 [Step 3/4] I171211 23:34:05.745867 338 storage/replica_command.go:2153  [replicate,n1,s1,r3/1:/System/NodeLiveness{-Max}] change replicas (ADD_REPLICA (n3,s3):2): read existing descriptor r3:/System/NodeLiveness{-Max} [(n1,s1):1, next=2]
[23:34:06]W:	 [Step 3/4] I171211 23:34:05.766201 338 storage/replica.go:3161  [n1,s1,r3/1:/System/NodeLiveness{-Max}] proposing ADD_REPLICA((n3,s3):2): updated=[(n1,s1):1 (n3,s3):2] next=3

3. n2 (which hasn't been in any of the activity) blows up?

[23:34:06]W:	 [Step 3/4] E171211 23:34:05.973144 615 vendor/github.com/coreos/etcd/raft/log.go:191  [n2,s2,r3/2:{-}] tocommit(28) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

What strikes me as odd (though maybe I'm confusing what's which) is that replicaID=2 should belong to node 3. Yet, in its failing message, n2 claims to have a replica of r3 with replicaID=2 but no data ( [n2,s2,r3/2:{-}])?

This makes me suspect something about coalesced heartbeats (since the stack below has handleHeartbeat), but could be a red herring. Any ideas, @bdarnell?

Full log: https://gist.github.com/tschottdorf/3960b61bc7bc6f67c713d61107db13b8

23:34:06]W:	 [Step 3/4] E171211 23:34:05.973358 615 storage/raft.go:113  [n2,s2,r3/2:{-}] Server crashing due to missing data. Was the server restarted with a
[23:34:06]W:	 [Step 3/4] different store directory than before? A --join parameter must be specified
[23:34:06]W:	 [Step 3/4] in order to make restarts with a new store directory safe. Please try again
[23:34:06]W:	 [Step 3/4] with a new directory and a valid --join flag. If this server was started with
[23:34:06]W:	 [Step 3/4] its old data or with a valid --join parameter and you are still seeing this,
[23:34:06]W:	 [Step 3/4] please report an issue at https://github.com/cockroachdb/cockroach/issues/new
[23:34:06]W:	 [Step 3/4] panic: tocommit(28) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]
[23:34:06]W:	 [Step 3/4] 	panic: tocommit(28) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
[23:34:06]W:	 [Step 3/4] 
[23:34:06]W:	 [Step 3/4] goroutine 615 [running]:
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc420f69440, 0x2cc9620, 0xc4211f8bd0)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:181 +0x136
[23:34:06]W:	 [Step 3/4] panic(0x1bac400, 0xc420340fa0)
[23:34:06]W:	 [Step 3/4] 	/usr/local/go/src/runtime/panic.go:491 +0x283
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc420340ee0, 0x1e950e9, 0x5d, 0xc422651cc0, 0x2, 0x2)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:115 +0x1dc
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc420d31ce0, 0x1c)
23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc4210485a0, 0x8, 0x2, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1139 +0x54
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.stepFollower(0xc4210485a0, 0x8, 0x2, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1085 +0x45e
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc4210485a0, 0x8, 0x2, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:817 +0x106d
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*RawNode).Step(0xc421231b40, 0x8, 0x2, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/rawnode.go:195 +0xc9
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Replica).stepRaftGroup.func1(0xc421231b40, 0x0, 0x0, 0x0)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3279 +0xd4
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc42103cc00, 0x1ef6300, 0xc4208f3b70, 0x7be5c7, 0xc4211f8390)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:603 +0x10b
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup(0xc42103cc00, 0xc4208f3b70, 0x0, 0x0)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:620 +0x92
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Replica).stepRaftGroup(0xc42103cc00, 0xc422872480, 0x8d4158, 0xc421231900)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3270 +0x50
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequestWithReplica(0xc4211e3400, 0x2cc9620, 0xc4211f8390, 0xc42103cc00, 0xc422872480, 0xc4211986e0)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3009 +0xeb
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x2cc9620, 0xc4211f8390, 0xc42103cc00, 0x2cc9620)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3609 +0x97
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc4211e3400, 0x2cc9620, 0xc4211f8390, 0xc422872480, 0xc4208f3ec8, 0x0)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2955 +0x135
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc4211e3400, 0x2cc9620, 0xc4211f8bd0, 0x3)
...

The text was updated successfully, but these errors were encountered:

bdarnell · 2017-12-17T00:48:24Z

Hmm. Coalesced heartbeats sound like a reasonable suspicion. Another shot in the dark is that it has something to do with the way that the test servers are started on port 0 and get assigned their real port after starting. Could that be confusing node ids somehow?

The one unusual thing about this test is the very short ScanInterval. This must be making the replication queue race with something. I would say it's racing with another ChangeReplicas, but I think that would make more noise in the logs.

a-robinson · 2017-12-19T02:06:31Z

This smells a lot like port reuse combined with #15801 / #20537, although it'd be very odd that we haven't seen it before if that's really the root cause. If it doesn't repro locally, I'd add some extra logging and have teamcity stress it more.

bdarnell · 2017-12-19T15:38:28Z

How does the automatic stress of new tests work? I can't find where it happens. Is there something different about that environment compared to our usual one?

tbg · 2017-12-19T17:00:35Z

@bdarnell it's here but looks pretty much like what you'd expect.

tbg · 2017-12-26T18:35:50Z

To dispel any doubt, this now also happened in one of our other tests: #21045

This test sets up a mixed cluster, but I'm unconvinced that that is relevant.

mrtracy · 2018-01-18T20:30:16Z

I got a repro of this on a GCE worker while trying to bisect TestAdminAPITableStats (it seems to suffer from a number of different flakes, although after looking through the history it appears it was recently re-enabled.)

Full log: https://gist.github.com/mrtracy/d677275be57c2c5b2764dbc6b9e17791

mrtracy · 2018-01-19T02:33:53Z

Another Gist with an interesting different failure: a segfault in raft.

https://gist.github.com/mrtracy/c1d1e3ebcf70c84d0656080f62fd4d74

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1109b07]

goroutine 3318 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc421177d40, 0x2cd7fe0, 0xc4229a4630)
        /home/matt/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:180 +0x136
panic(0x1bd25e0, 0x2c91710)
        /home/matt/go1.9/src/runtime/panic.go:491 +0x283
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send.func1(0xc4219fd678, 0xc4219fd6f0, 0x150b0b9161adadb4, 0x0, 0xc4219fd6e8, 0xc42059e000)
        /home/matt/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2581 +0x283
panic(0x1bd25e0, 0x2c91710)
        /home/matt/go1.9/src/runtime/panic.go:491 +0x283
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*Progress).maybeUpdate(...)
        /home/matt/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/progress.go:120
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).appendEntry(0xc421a0e700, 0xc422bdc460, 0x1,
 0x1)
        /home/matt/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:598 +0xe7
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).becomeLeader(0xc421a0e700)
        /home/matt/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:700 +0x272
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).campaign(0xc421a0e700, 0x1de783b, 0x10)
        /home/matt/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:723 +0x664
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc421a0e700, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0, 0x0, 0x0, ...)
        /home/matt/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:835 +0x347
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*RawNode).Campaign(0xc420c22d00, 0xc42155fce0, 0x3)
        /home/matt/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/rawnode.go:140 +0x61
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc420c54a80, 0x1, 0x1eb4028, 0xc421f18c60, 
0x100000001)
        /home/matt/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:600 +0x3a2
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).maybeInitializeRaftGroup(0xc420c54a80, 0x2cd7fe0, 0xc4229a46c0)

a-robinson · 2018-01-20T00:01:04Z

@bdarnell, mind taking a look at @mrtracy's segfault? It's in becomeLeader, which I know you recently changed.

bdarnell · 2018-01-20T16:02:12Z

We've seen the nil pointer panic before, most recently in #20629. It may or may not be related to the tocommit issue (odds are its the same root cause, whatever race this short ScanInterval is hitting). I'm pretty sure it's unrelated to my recent becomeLeader change.

To decipher the stack trace a bit, when you see raft.campaign() calling becomeLeader and then panicking, this means that raft believes it is the only replica of the range (so it can become a leader as soon as it tries to campaign), but its replicaID is not that of the leader.

tbg · 2018-01-25T16:03:53Z

I can't confirm this, but #21771 fixed some instances of panic: tocommit(XXX) is out of range [lastIndex(0)], perhaps some of those explosions here were actually that bug instead of port reuse. (Though port reuse is one that we expect to actually happen as well).

bdarnell · 2018-02-09T01:43:47Z

I was never able to reproduce this, but optimistically closing as fixed by #22518

tbg self-assigned this Dec 15, 2017

tbg mentioned this issue Dec 15, 2017

server: re-enable TestAdminAPITableStats #20625

Merged

tbg mentioned this issue Dec 26, 2017

server: TestClusterVersionBootstrapStrict failed under stress #21045

Closed

bdarnell assigned bdarnell and unassigned tbg Feb 8, 2018

bdarnell mentioned this issue Feb 8, 2018

core: RPC re-connections are not validated #20537

Closed

bdarnell closed this as completed Feb 9, 2018

nvanbenschoten mentioned this issue Mar 29, 2018

storage: tocommit is out of range #24337

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: panic: tocommit(28) is out of range [lastIndex(0)] in `StartTestCluster` #20764

storage: panic: tocommit(28) is out of range [lastIndex(0)] in `StartTestCluster` #20764

tbg commented Dec 15, 2017

bdarnell commented Dec 17, 2017

a-robinson commented Dec 19, 2017

bdarnell commented Dec 19, 2017

tbg commented Dec 19, 2017

tbg commented Dec 26, 2017

mrtracy commented Jan 18, 2018 •

edited

Loading

mrtracy commented Jan 19, 2018

a-robinson commented Jan 20, 2018

bdarnell commented Jan 20, 2018

tbg commented Jan 25, 2018

bdarnell commented Feb 9, 2018

storage: panic: tocommit(28) is out of range [lastIndex(0)] in StartTestCluster #20764

storage: panic: tocommit(28) is out of range [lastIndex(0)] in StartTestCluster #20764

Comments

tbg commented Dec 15, 2017

bdarnell commented Dec 17, 2017

a-robinson commented Dec 19, 2017

bdarnell commented Dec 19, 2017

tbg commented Dec 19, 2017

tbg commented Dec 26, 2017

mrtracy commented Jan 18, 2018 • edited Loading

mrtracy commented Jan 19, 2018

a-robinson commented Jan 20, 2018

bdarnell commented Jan 20, 2018

tbg commented Jan 25, 2018

bdarnell commented Feb 9, 2018

storage: panic: tocommit(28) is out of range [lastIndex(0)] in `StartTestCluster` #20764

storage: panic: tocommit(28) is out of range [lastIndex(0)] in `StartTestCluster` #20764

mrtracy commented Jan 18, 2018 •

edited

Loading