Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: package timed out under stress #37954

Closed
cockroach-teamcity opened this issue May 31, 2019 · 1 comment · Fixed by #37986
Closed

storage: package timed out under stress #37954

cockroach-teamcity opened this issue May 31, 2019 · 1 comment · Fixed by #37986
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/83e62d69214aaa0f7b976f764b97b0e21a41cde3

Parameters:

TAGS=
GOFLAGS=-race

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=(unknown) PKG=github.com/cockroachdb/cockroach/pkg/storage TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1318089&tab=buildLog

Slow failing tests:
TestReplicaRangefeedNudgeSlowClosedTimestamp - 1.48s

Slow passing tests:
TestMergeQueue - 178.19s
TestStoreRangeMoveDecommissioning - 27.16s
TestRemovePlaceholderRace - 18.97s
TestStoreRangeSplitBackpressureWrites - 16.07s
TestGossipHandlesReplacedNode - 13.49s
TestWedgedReplicaDetection - 11.53s
TestAllocatorFullDisks - 11.38s
TestSplitTriggerRaftSnapshotRace - 9.62s
TestNodeLivenessStatusMap - 9.60s
TestStoreSplitFailsAfterMaxRetries - 8.53s
TestConsistencyQueueRecomputeStats - 8.46s
TestRefreshPendingCommands - 8.37s
TestSnapshotRaftLogLimit - 8.34s
TestReplicaRangefeedRetryErrors - 7.99s
TestRandomConcurrentAdminChangeReplicasRequests - 7.60s
TestStoreSplitDisappearingReplicas - 7.28s
TestSplitSnapshotRace_SplitWins - 5.95s
TestLeaseInfoRequest - 5.64s
TestRaftRemoveRace - 5.60s
TestGossipFirstRange - 5.47s

@cockroach-teamcity cockroach-teamcity added this to the 19.1 milestone May 31, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels May 31, 2019
@tbg
Copy link
Member

tbg commented Jun 3, 2019

goroutine 246255 [semacquire, 29 minutes]:
sync.runtime_Semacquire(0xc0002f7af8)
	/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc0002f7af0)
	/usr/local/go/src/sync/waitgroup.go:130 +0xb5
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers(0xc00b286770)
	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:100 +0x1a0
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers-fm()
	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:179 +0x42
github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close(0xc005b0ed50)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:92 +0x35
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop(0xc0007b7540, 0x4b44580, 0xc0000d4010)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:497 +0x4cc
github.com/cockroachdb/cockroach/pkg/storage_test.TestReplicaRangefeedNudgeSlowClosedTimestamp(0xc0055be600)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_rangefeed_test.go:766 +0x8fc
testing.tRunner(0xc0055be600, 0x3e203a0)
	/usr/local/go/src/testing/testing.go:827 +0x163
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:878 +0x65a

A stopper is stuck quiescing

goroutine 247940 [sync.Cond.Wait, 29 minutes]:
sync.runtime_notifyListWait(0xc000a17610, 0xc000000007)
	/usr/local/go/src/runtime/sema.go:510 +0xeb
sync.(*Cond).Wait(0xc000a17600)
	/usr/local/go/src/sync/cond.go:56 +0x8e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Quiesce(0xc0007b7900, 0x4b44580, 0xc0000d4018)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:548 +0x253
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers.func1(0xc0002f7af0, 0xc0007b7900)
	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:96 +0xaa
created by github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers
	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:93 +0x172

Must be this, it's the only task in the whole stack trace


goroutine 246507 [select, 29 minutes]:
github.com/cockroachdb/cockroach/pkg/storage/closedts/provider.(*Provider).Subscribe(0xc00063ad80, 0x4b44600, 0xc00078fa40, 0xc00af9e780)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/closedts/provider/provider.go:321 +0xd30
github.com/cockroachdb/cockroach/pkg/storage.(*Store).startClosedTimestampRangefeedSubscriber.func1(0x4b44600, 0xc00078fa40)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:1607 +0xb2
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0007b7900, 0x4b44600, 0xc00078fa40, 0xc00a2215c0, 0x25, 0x0, 0x0, 0xc008be7b00)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:325 +0xf4
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:320 +0x14e

So it's this code

cockroach/pkg/storage/store.go

Lines 1613 to 1619 in 009f70a

ch := make(chan ctpb.Entry, 8)
const name = "closedts-rangefeed-subscriber"
if err := s.stopper.RunAsyncTask(ctx, name, func(ctx context.Context) {
s.cfg.ClosedTimestamp.Provider.Subscribe(ctx, ch)
}); err != nil {
return
}

Hmm, there's code that shuts down the provider by calling .drain(). There's a goroutine in a weird state here:

p.mu.Unlock()
for {
p.mu.Broadcast()
p.mu.Lock()
done := true
for _, sub := range p.mu.subscribers {
done = done && sub == nil
}

You'd think it wouldn't do that for 40 minutes, it should be done eventually.

goroutine 246436 [runnable]:
internal/race.Acquire(0xc00063ad94)
	/usr/local/go/src/internal/race/race.go:16 +0x4d
sync.(*RWMutex).Lock(0xc00063ad88)
	/usr/local/go/src/sync/rwmutex.go:102 +0x85
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*RWMutex).Lock(0xc00063ad88)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/mutex_sync_race.go:65 +0x3b
github.com/cockroachdb/cockroach/pkg/storage/closedts/provider.(*Provider).drain(0xc00063ad80)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/closedts/provider/provider.go:97 +0x110
github.com/cockroachdb/cockroach/pkg/storage/closedts/provider.(*Provider).runCloser(0xc00063ad80, 0x4b44600, 0xc008adb860)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/closedts/provider/provider.go:141 +0xa8c
github.com/cockroachdb/cockroach/pkg/storage/closedts/provider.(*Provider).runCloser-fm(0x4b44600, 0xc008adb860)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/closedts/provider/provider.go:88 +0x56
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc005b0e520, 0xc0007b7900, 0xc005b0e510)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0x146
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xc4

Hmm I see what this is. This is a rare situation because the subscription is actually blocking on the channel (it's buffered, so it doesn't usually fill up). It's not realizing that the stopper is draining (which implies that the provider is draining) because that needs a bool check. Well, in this case, it seems that the problem is that the consumer just left without giving notice. I'll send a PR.

tbg added a commit to tbg/cockroach that referenced this issue Jun 4, 2019
craig bot pushed a commit that referenced this issue Jun 4, 2019
37986: storage: avoid deadlock on (*Stopper).Quiesce r=nvanbenschoten a=tbg

See
#37954 (comment).

Closes #37954.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@craig craig bot closed this as completed in #37986 Jun 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants