Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestSplitTriggerMeetsUnexpectedReplicaID failed #75717

Closed
cockroach-teamcity opened this issue Jan 30, 2022 · 7 comments · Fixed by #75735 or cockroachdb/pebble#1485
Closed
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-storage Storage Team

Comments

@cockroach-teamcity
Copy link
Member

kv/kvserver.TestSplitTriggerMeetsUnexpectedReplicaID failed with artifacts on master @ a5158c4c93e9dca858b38ef7e94321f0a0a2b5c5:

=== RUN   TestSplitTriggerMeetsUnexpectedReplicaID
    test_log_scope.go:79: test logs captured to: /artifacts/tmp/_tmp/751d67000aac5f3394c2369309253f02/logTestSplitTriggerMeetsUnexpectedReplicaID2403169547
    test_log_scope.go:80: use -show-logs to present logs inline
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss,deadlock

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 30, 2022
@tbg
Copy link
Member

tbg commented Jan 31, 2022

#74013

@tbg tbg closed this as completed Jan 31, 2022
@tbg
Copy link
Member

tbg commented Jan 31, 2022

Nevermind, finally found the right place to check:

panic: close of closed channel [recovered]
	panic: close of closed channel [recovered]
	panic: close of closed channel

goroutine 59872246 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc0099e4630, {0x545adf8, 0xc02157ce70})
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:251 +0x94
panic({0x3cd0980, 0x537f5b0})
	GOROOT/src/runtime/panic.go:1038 +0x215
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc0099e4630, {0x545adf8, 0xc02157cf60})
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:251 +0x94
panic({0x3cd0980, 0x537f5b0})
	GOROOT/src/runtime/panic.go:1038 +0x215
github.com/cockroachdb/pebble/sstable.(*writeQueue).finish(...)
	github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/write_queue.go:108
github.com/cockroachdb/pebble/sstable.(*Writer).Close(0xc013f8a000)
	github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/writer.go:1047 +0xc7
github.com/cockroachdb/cockroach/pkg/storage.(*SSTWriter).Close(...)
	github.com/cockroachdb/cockroach/pkg/storage/sst_writer.go:296
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*multiSSTWriter).Close(0xc0013cf028)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_snapshot.go:221 +0x2a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*kvBatchSnapshotStrategy).Receive(_, {_, _}, {_, _}, {{0x1d, 0x10, 0xc01c6c4230, 0xc0104b4fc0, 0xc01037f7f0, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_snapshot.go:283 +0xbdc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).receiveSnapshot(0xc00e844000, {0x545adf8, 0xc02157cf60}, 0xc01d96c900, {0x7f7f425cd778, 0xc01c9cd750})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_snapshot.go:679 +0x5c5
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleSnapshot.func1({0x545adf8, 0xc02157cf60})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:82 +0x145
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc0099e4630, {0x545adf8, 0xc02157cf60}, {0x40bb2e0, 0xc02157cea0}, 0xc000401c48)
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:344 +0xdd
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleSnapshot(0xc00e844000, {0x545adf8, 0xc02157ce70}, 0xc01d96c900, {0x7f7f425cd750, 0xc01c9cd750})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:72 +0xcc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftSnapshot.func1.1({0x54c19d0, 0xc01c9cd750}, 0xc018879400, {0x545adf8, 0xc02157ce70})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:434 +0x13a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftSnapshot.func1({0x545adf8, 0xc02157ce70})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:435 +0x45
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 +0x16f
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:483 +0x445
I220130 11:28:49.080785 1 (gostd) testmain.go:1530  [-] 1  Test //pkg/kv/kvserver:kvserver_test exited with error code 2

Same as #75718

@tbg tbg reopened this Jan 31, 2022
@tbg
Copy link
Member

tbg commented Jan 31, 2022

I was thinking this had to be another "use after stopped" case, but HandleSnapshot has a task:

// HandleSnapshot reads an incoming streaming snapshot and applies it if
// possible.
func (s *Store) HandleSnapshot(
ctx context.Context, header *SnapshotRequest_Header, stream SnapshotResponseStream,
) error {
ctx = s.AnnotateCtx(ctx)
const name = "storage.Store: handle snapshot"
return s.stopper.RunTaskWithErr(ctx, name, func(ctx context.Context) error {
s.metrics.RaftRcvdMessages[raftpb.MsgSnap].Inc(1)
if s.IsDraining() {
return stream.Send(&SnapshotResponse{
Status: SnapshotResponse_ERROR,
Message: storeDrainingMsg,
})
}
return s.receiveSnapshot(ctx, header, stream)
})
}

@tbg
Copy link
Member

tbg commented Jan 31, 2022

The only case in which we double-close (which I thought could be triggering this issue) is in this path:

if req.Final {
// We finished receiving all batches and log entries. It's possible that
// we did not receive any key-value pairs for some of the key ranges, but
// we must still construct SSTs with range deletion tombstones to remove
// the data.
dataSize, err := msstw.Finish(ctx)
if err != nil {
return noSnap, errors.Wrapf(err, "finishing sst for raft snapshot")
}
msstw.Close()

It's unlikely that we really hit that (we don't generate broken UUIDs) but also, Close is idempotent:

// Close finishes and frees memory and other resources. Close is idempotent.
func (fw *SSTWriter) Close() {
if fw.fw == nil {
return
}
// pebble.Writer *does* return interesting errors from Close... but normally
// we already called its Close() in Finish() and we no-op here. Thus the only
// time we expect to be here is in a deferred Close(), in which case the caller
// probably is already returning some other error, so returning one from this
// method just makes for messy defers.
_ = fw.fw.Close()
fw.fw = nil
}

So at this point this looks like a problem with pebble/sstable.Writer which apparently can panic upon Close.

cc @cockroachdb/storage

@sumeerbhola
Copy link
Collaborator

@bananabrick this looks related to your recent change

@ajwerner
Copy link
Contributor

See also #75687.

craig bot pushed a commit that referenced this issue Jan 31, 2022
75735: Revert "vendor: bump Pebble to b958d9a7760b" r=nicktrav a=ajwerner

This reverts commit 51d9f70.

That pebble bump exposed flakes like #75717 and #75687. 

Fixes #75687
Fixes #75717 

Co-authored-by: Andrew Werner <[email protected]>
@bananabrick bananabrick self-assigned this Jan 31, 2022
@craig craig bot closed this as completed in 84bdba8 Jan 31, 2022
@bananabrick
Copy link
Contributor

bananabrick commented Jan 31, 2022

@tbg I'm going to update the pebble code to not panic during a double Close. However, that channel can only be closed when Writer.Close is called on the sstable writer. Which means that the test is closing the Writer twice.

I'm looking into whether pebble incorrectly causes the Writer to be closed twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-storage Storage Team
Projects
None yet
5 participants