Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: panic: tocommit(28) is out of range [lastIndex(0)] in StartTestCluster #20764

Closed
tbg opened this issue Dec 15, 2017 · 11 comments
Closed
Assignees

Comments

@tbg
Copy link
Member

tbg commented Dec 15, 2017

This happened in #20625. Our CI harness tries to figure out new/changed tests and stress-tests for five minutes. This resulted in the following crash that I haven't been able to reproduce in >2h of stress runs, though the code that caused it was essentially

func TestAdminAPITableStats(t *testing.T) {
	const nodeCount = 3
	tc := testcluster.StartTestCluster(t, nodeCount, base.TestClusterArgs{
		ReplicationMode: base.ReplicationAuto,
		ServerArgs: base.TestServerArgs{
			ScanInterval:    time.Millisecond,
			ScanMaxIdleTime: time.Millisecond,
		},
	})
}

What I see in the logs is a bit confusing. The range in question is r3 and it has the following activity:

1. n1 generates a preemptive snapshot and streams it to n3.

[23:34:06]W:	 [Step 3/4] I171211 23:34:05.732544 338 storage/replica_raftstorage.go:527  [replicate,n1,s1,r3/1:/System/NodeLiveness{-Max}] generated preemptive snapshot 3fdd03e7 at index 25
[23:34:06]W:	 [Step 3/4] I171211 23:34:05.744716 338 storage/store.go:3552  [replicate,n1,s1,r3/1:/System/NodeLiveness{-Max}] streamed snapshot to (n3,s3):?: kv pairs: 13, log entries: 15, rate-limit: 8.0 MiB/sec, 12ms
[23:34:06]W:	 [Step 3/4] I171211 23:34:05.745142 1229 storage/replica_raftstorage.go:733  [n3,s3,r3/?:{-}] applying preemptive snapshot at index 25 (id=3fdd03e7, encoded size=6742, 1 rocksdb batches, 15 log entries)
[23:34:06]W:	 [Step 3/4] I171211 23:34:05.745415 1229 storage/replica_raftstorage.go:739  [n3,s3,r3/?:/System/NodeLiveness{-Max}] applied preemptive snapshot in 0ms [clear=0ms batch=0ms entries=0ms commit=0ms]

2. n1 starts a replica change that adds n3.

[23:34:06]W:	 [Step 3/4] I171211 23:34:05.745867 338 storage/replica_command.go:2153  [replicate,n1,s1,r3/1:/System/NodeLiveness{-Max}] change replicas (ADD_REPLICA (n3,s3):2): read existing descriptor r3:/System/NodeLiveness{-Max} [(n1,s1):1, next=2]
[23:34:06]W:	 [Step 3/4] I171211 23:34:05.766201 338 storage/replica.go:3161  [n1,s1,r3/1:/System/NodeLiveness{-Max}] proposing ADD_REPLICA((n3,s3):2): updated=[(n1,s1):1 (n3,s3):2] next=3

3. n2 (which hasn't been in any of the activity) blows up?

[23:34:06]W:	 [Step 3/4] E171211 23:34:05.973144 615 vendor/github.com/coreos/etcd/raft/log.go:191  [n2,s2,r3/2:{-}] tocommit(28) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

What strikes me as odd (though maybe I'm confusing what's which) is that replicaID=2 should belong to node 3. Yet, in its failing message, n2 claims to have a replica of r3 with replicaID=2 but no data ( [n2,s2,r3/2:{-}])?

This makes me suspect something about coalesced heartbeats (since the stack below has handleHeartbeat), but could be a red herring. Any ideas, @bdarnell?

Full log: https://gist.github.com/tschottdorf/3960b61bc7bc6f67c713d61107db13b8

23:34:06]W:	 [Step 3/4] E171211 23:34:05.973358 615 storage/raft.go:113  [n2,s2,r3/2:{-}] Server crashing due to missing data. Was the server restarted with a
[23:34:06]W:	 [Step 3/4] different store directory than before? A --join parameter must be specified
[23:34:06]W:	 [Step 3/4] in order to make restarts with a new store directory safe. Please try again
[23:34:06]W:	 [Step 3/4] with a new directory and a valid --join flag. If this server was started with
[23:34:06]W:	 [Step 3/4] its old data or with a valid --join parameter and you are still seeing this,
[23:34:06]W:	 [Step 3/4] please report an issue at https://github.com/cockroachdb/cockroach/issues/new
[23:34:06]W:	 [Step 3/4] panic: tocommit(28) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]
[23:34:06]W:	 [Step 3/4] 	panic: tocommit(28) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
[23:34:06]W:	 [Step 3/4] 
[23:34:06]W:	 [Step 3/4] goroutine 615 [running]:
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc420f69440, 0x2cc9620, 0xc4211f8bd0)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:181 +0x136
[23:34:06]W:	 [Step 3/4] panic(0x1bac400, 0xc420340fa0)
[23:34:06]W:	 [Step 3/4] 	/usr/local/go/src/runtime/panic.go:491 +0x283
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc420340ee0, 0x1e950e9, 0x5d, 0xc422651cc0, 0x2, 0x2)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:115 +0x1dc
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc420d31ce0, 0x1c)
23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc4210485a0, 0x8, 0x2, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1139 +0x54
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.stepFollower(0xc4210485a0, 0x8, 0x2, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1085 +0x45e
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc4210485a0, 0x8, 0x2, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:817 +0x106d
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*RawNode).Step(0xc421231b40, 0x8, 0x2, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/rawnode.go:195 +0xc9
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Replica).stepRaftGroup.func1(0xc421231b40, 0x0, 0x0, 0x0)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3279 +0xd4
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc42103cc00, 0x1ef6300, 0xc4208f3b70, 0x7be5c7, 0xc4211f8390)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:603 +0x10b
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup(0xc42103cc00, 0xc4208f3b70, 0x0, 0x0)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:620 +0x92
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Replica).stepRaftGroup(0xc42103cc00, 0xc422872480, 0x8d4158, 0xc421231900)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3270 +0x50
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequestWithReplica(0xc4211e3400, 0x2cc9620, 0xc4211f8390, 0xc42103cc00, 0xc422872480, 0xc4211986e0)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3009 +0xeb
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x2cc9620, 0xc4211f8390, 0xc42103cc00, 0x2cc9620)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3609 +0x97
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc4211e3400, 0x2cc9620, 0xc4211f8390, 0xc422872480, 0xc4208f3ec8, 0x0)
[23:34:06]W:	 [Step 3/4] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2955 +0x135
[23:34:06]W:	 [Step 3/4] github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc4211e3400, 0x2cc9620, 0xc4211f8bd0, 0x3)
...
@bdarnell
Copy link
Contributor

Hmm. Coalesced heartbeats sound like a reasonable suspicion. Another shot in the dark is that it has something to do with the way that the test servers are started on port 0 and get assigned their real port after starting. Could that be confusing node ids somehow?

The one unusual thing about this test is the very short ScanInterval. This must be making the replication queue race with something. I would say it's racing with another ChangeReplicas, but I think that would make more noise in the logs.

@a-robinson
Copy link
Contributor

This smells a lot like port reuse combined with #15801 / #20537, although it'd be very odd that we haven't seen it before if that's really the root cause. If it doesn't repro locally, I'd add some extra logging and have teamcity stress it more.

@bdarnell
Copy link
Contributor

How does the automatic stress of new tests work? I can't find where it happens. Is there something different about that environment compared to our usual one?

@tbg
Copy link
Member Author

tbg commented Dec 19, 2017

@bdarnell it's here but looks pretty much like what you'd expect.

@tbg
Copy link
Member Author

tbg commented Dec 26, 2017

To dispel any doubt, this now also happened in one of our other tests: #21045

This test sets up a mixed cluster, but I'm unconvinced that that is relevant.

@mrtracy
Copy link
Contributor

mrtracy commented Jan 18, 2018

I got a repro of this on a GCE worker while trying to bisect TestAdminAPITableStats (it seems to suffer from a number of different flakes, although after looking through the history it appears it was recently re-enabled.)

Full log: https://gist.github.com/mrtracy/d677275be57c2c5b2764dbc6b9e17791

@mrtracy
Copy link
Contributor

mrtracy commented Jan 19, 2018

Another Gist with an interesting different failure: a segfault in raft.

https://gist.github.com/mrtracy/c1d1e3ebcf70c84d0656080f62fd4d74

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1109b07]

goroutine 3318 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc421177d40, 0x2cd7fe0, 0xc4229a4630)
        /home/matt/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:180 +0x136
panic(0x1bd25e0, 0x2c91710)
        /home/matt/go1.9/src/runtime/panic.go:491 +0x283
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send.func1(0xc4219fd678, 0xc4219fd6f0, 0x150b0b9161adadb4, 0x0, 0xc4219fd6e8, 0xc42059e000)
        /home/matt/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2581 +0x283
panic(0x1bd25e0, 0x2c91710)
        /home/matt/go1.9/src/runtime/panic.go:491 +0x283
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*Progress).maybeUpdate(...)
        /home/matt/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/progress.go:120
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).appendEntry(0xc421a0e700, 0xc422bdc460, 0x1,
 0x1)
        /home/matt/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:598 +0xe7
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).becomeLeader(0xc421a0e700)
        /home/matt/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:700 +0x272
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).campaign(0xc421a0e700, 0x1de783b, 0x10)
        /home/matt/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:723 +0x664
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc421a0e700, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0, 0x0, 0x0, ...)
        /home/matt/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:835 +0x347
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*RawNode).Campaign(0xc420c22d00, 0xc42155fce0, 0x3)
        /home/matt/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/rawnode.go:140 +0x61
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc420c54a80, 0x1, 0x1eb4028, 0xc421f18c60, 
0x100000001)
        /home/matt/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:600 +0x3a2
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).maybeInitializeRaftGroup(0xc420c54a80, 0x2cd7fe0, 0xc4229a46c0)

@a-robinson
Copy link
Contributor

@bdarnell, mind taking a look at @mrtracy's segfault? It's in becomeLeader, which I know you recently changed.

@bdarnell
Copy link
Contributor

We've seen the nil pointer panic before, most recently in #20629. It may or may not be related to the tocommit issue (odds are its the same root cause, whatever race this short ScanInterval is hitting). I'm pretty sure it's unrelated to my recent becomeLeader change.

To decipher the stack trace a bit, when you see raft.campaign() calling becomeLeader and then panicking, this means that raft believes it is the only replica of the range (so it can become a leader as soon as it tries to campaign), but its replicaID is not that of the leader.

@tbg
Copy link
Member Author

tbg commented Jan 25, 2018

I can't confirm this, but #21771 fixed some instances of panic: tocommit(XXX) is out of range [lastIndex(0)], perhaps some of those explosions here were actually that bug instead of port reuse. (Though port reuse is one that we expect to actually happen as well).

@bdarnell
Copy link
Contributor

bdarnell commented Feb 9, 2018

I was never able to reproduce this, but optimistically closing as fixed by #22518

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants