Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: version/mixed/nodes=5 failed #38996

Closed
cockroach-teamcity opened this issue Jul 19, 2019 · 3 comments · Fixed by #39003
Closed

roachtest: version/mixed/nodes=5 failed #38996

cockroach-teamcity opened this issue Jul 19, 2019 · 3 comments · Fixed by #39003
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/1ca35fc4a0e2665e7f6efd945e65a0db97984fa7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=version/mixed/nodes=5 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1396096&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190719-1396096/version/mixed/nodes=5/run_1
	cluster.go:2090,version.go:233,version.go:246,test_runner.go:691: unexpected node event: 4: dead

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Jul 19, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 19, 2019
@nvanbenschoten
Copy link
Member

We see a v19.1.3 node crash with:

F190719 16:42:52.637426 530616 storage/replica.go:927  [n4,s4,r67/?:/Table/63/2/0/{49933…-57735…}] on-disk and in-memory state diverged: [TxnSpanGCThreshold: &hlc.Timestamp{} != nil]
goroutine 530616 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b01, 0xc000057b60, 0x53d1900, 0x12)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1020 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5b65b00, 0xc000000004, 0x53d192e, 0x12, 0x39f, 0xc00b6e17a0, 0x83)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:878 +0x93d
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3a49c80, 0xc0088b7d10, 0x4, 0x2, 0x0, 0x0, 0xc00f8e26e8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d8
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3a49c80, 0xc0088b7d10, 0x1, 0x4, 0x0, 0x0, 0xc00f8e26e8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x3a49c80, 0xc0088b7d10, 0xc00f8e26e8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:191 +0x6c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).assertStateLocked(0xc010894500, 0x3a49c80, 0xc0088b7d10, 0x3a65a60, 0xc000446900)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:927 +0x673
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).applySnapshot(0xc010894500, 0x3a49c80, 0xc0088b7d10, 0x9644522e3b3ec7cb, 0xefd08828dfc7a8b5, 0xc0060be960, 0x4, 0x4, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raftstorage.go:1008 +0x1293
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftSnapshotRequest.func1(0x3a49c80, 0xc0088b7d10, 0xc010894500, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3491 +0x581
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc000814000, 0x3a49c80, 0xc0088b7d10, 0xc00ebff908, 0xc0053237c8, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3279 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftSnapshotRequest(0xc000814000, 0x3a49c80, 0xc00a248240, 0xc00ebff8c0, 0x9644522e3b3ec7cb, 0xefd08828dfc7a8b5, 0xc0060be960, 0x4, 0x4, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3335 +0xd5
github.com/cockroachdb/cockroach/pkg/storage.(*Store).receiveSnapshot(0xc000814000, 0x3a49c80, 0xc00a248240, 0xc00ebff8c0, 0x7f9618368d08, 0xc01256e1f0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store_snapshot.go:666 +0x313
github.com/cockroachdb/cockroach/pkg/storage.(*Store).HandleSnapshot(0xc000814000, 0xc00ebff8c0, 0x7f9618368cd8, 0xc01256e1f0, 0xc01256e1f0, 0xc000cce5d8)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3157 +0x203
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1.1(0x3a6e880, 0xc01256e1f0, 0xc000784bb0, 0x3a49c80, 0xc00a2481e0, 0x734f72, 0xc0104f38b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:386 +0x13b
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1(0x3a49c80, 0xc00a2481e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:387 +0x5d
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc000934000, 0x3a49c80, 0xc00a2481e0, 0xc00dba90c0, 0x32, 0x0, 0x0, 0xc00a248210)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:325 +0xe6
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:320 +0x134

This must have been broken by #38817. It almost certainly has to do with the nullability of txn_span_gc_threshold and its removal here.

@nvanbenschoten
Copy link
Member

Oh, this is after a snapshot. We see that the on-disk state of the TxnSpanGCThreshold is an empty value but the in-memory state of the field is nil. Both the on-disk and the in-memory states of the field come directly from the snapshot, which was presumably sent by a node running master code.

@nvanbenschoten
Copy link
Member

We can't strip the key out of the kv batch we send in snapshots because then the follower would become inconsistent and would fail consistency checks. Instead, we need to continue sending this key in the SnapshotRequest_Header.State.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 19, 2019
Fixes cockroachdb#38996.

We saw in the referenced issue that a 19.1 node crashed after being sent
a snapshot with a TxnSpanGCThresholdKey but without the corresponding
value in SnapshotRequest_Header.ReplicaState.TxnSpanGCThreshold. This
commit avoids this assertion failure by continuing to send this field
in the snapshot header, even though it is no longer maintained.

19.2 nodes will ignore the field during entry application and during
snapshot ingestion, so the change has no effect on them. However,
we can rest assured that the same assertion would fire if we messed
this up on 19.2 nodes.

Release note: None
craig bot pushed a commit that referenced this issue Jul 20, 2019
39003: storage: continue sending ReplicaState.TxnSpanGCThreshold to 19.1 nodes r=ajwerner a=nvanbenschoten

Fixes #38996.

We saw in the referenced issue that a 19.1 node crashed after being sent
a snapshot with a TxnSpanGCThresholdKey but without the corresponding
value in SnapshotRequest_Header.ReplicaState.TxnSpanGCThreshold. This
commit avoids this assertion failure by continuing to send this field
in the snapshot header, even though it is no longer maintained.

19.2 nodes will ignore the field during entry application and during
snapshot ingestion, so the change has no effect on them. However,
we can rest assured that the same assertion would fire if we messed
this up on 19.2 nodes.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #39003 Jul 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants