storage: TestStoreRangeMergeSlowWatcher times out under race [skipped] #37191

RaduBerinde · 2019-04-29T19:11:56Z

https://teamcity.cockroachdb.com/viewLog.html?buildId=1265447&buildTypeId=Cockroach_UnitTests

https://teamcity.cockroachdb.com/viewLog.html?buildId=1265559&buildTypeId=Cockroach_UnitTests

E190429 15:52:09.327853 2674683 storage/queue.go:974  [replicaGC,s2,r22/2:{b-c}] operation "replicaGC queue process replica 22" timed out after 1m0s
W190429 15:52:12.717140 70760 storage/store_rebalancer.go:227  [s2,store-rebalancer] StorePool missing descriptor for local store
W190429 15:52:13.444342 70523 storage/store_rebalancer.go:227  [s1,store-rebalancer] StorePool missing descriptor for local store
W190429 15:52:23.151384 70897 storage/store_rebalancer.go:227  [s3,store-rebalancer] StorePool missing descriptor for local store
W190429 15:52:58.676598 70760 storage/store_rebalancer.go:227  [s2,store-rebalancer] StorePool missing descriptor for local store
W190429 15:53:07.155575 70523 storage/store_rebalancer.go:227  [s1,store-rebalancer] StorePool missing descriptor for local store
E190429 15:53:09.427880 2756557 storage/queue.go:974  [replicaGC,s2,r22/2:{b-c}] operation "replicaGC queue process replica 22" timed out after 1m0s
W190429 15:53:27.772187 70897 storage/store_rebalancer.go:227  [s3,store-rebalancer] StorePool missing descriptor for local store
W190429 15:54:04.524476 70760 storage/store_rebalancer.go:227  [s2,store-rebalancer] StorePool missing descriptor for local store
W190429 15:54:08.120934 70523 storage/store_rebalancer.go:227  [s1,store-rebalancer] StorePool missing descriptor for local store
E190429 15:54:09.527593 2839030 storage/queue.go:974  [replicaGC,s2,r22/2:{b-c}] operation "replicaGC queue process replica 22" timed out after 1m0s
W190429 15:54:35.671123 70897 storage/store_rebalancer.go:227  [s3,store-rebalancer] StorePool missing descriptor for local store
W190429 15:54:55.440544 70523 storage/store_rebalancer.go:227  [s1,store-rebalancer] StorePool missing descriptor for local store
W190429 15:55:03.928140 70760 storage/store_rebalancer.go:227  [s2,store-rebalancer] StorePool missing descriptor for local store
panic: test timed out after 45m0s

The text was updated successfully, but these errors were encountered:

RaduBerinde · 2019-04-29T19:41:41Z

I reproed on dff4132 with:

make stressrace PKG=./pkg/storage TESTS=TestStoreRangeMergeSlowWatcher STRESSFLAGS='-p 4' TESTTIMEOUT=2m

E190429 19:33:09.988487 2320 storage/queue.go:974  [replicaGC,s2,r22/2:{b-c}] operation "replicaGC queue process replica 22" timed out after 1m0s
W190429 19:33:10.319865 660 storage/store_rebalancer.go:227  [s3,store-rebalancer] StorePool missing descriptor for local store
W190429 19:33:52.043337 173 storage/store_rebalancer.go:227  [s1,store-rebalancer] StorePool missing descriptor for local store
W190429 19:33:58.312685 660 storage/store_rebalancer.go:227  [s3,store-rebalancer] StorePool missing descriptor for local store
W190429 19:34:03.723694 435 storage/store_rebalancer.go:227  [s2,store-rebalancer] StorePool missing descriptor for local store
panic: test timed out after 2m0s

tbg · 2019-04-29T19:43:14Z

👍 thanks for crossing the t, cc @andreimatei

andreimatei · 2019-04-30T22:13:40Z

I believe this test is likely what's causing our mysterious 1h timeouts for master TC builds. I've reproed it running for 40 min when being run in parallel with other packages (as it is on TC).

andreimatei · 2019-05-01T00:59:43Z

The badness starts happening around Apr 10th. Exactly when I can't tell yet; I've tricked myself bisecting once already because the amount of time it takes to get a failure seems to vary. In fact, I have yet to find any commit where the stress can survive 5 min with -p 32 on my machine. But it seems to have gotten worse around Apr 10th? I'm pretty confused still.

Every time it times out, this is the trace for the test's goroutine. It's blocked on a channel in a test knob it adds.

goroutine 22 [chan send]:
github.com/cockroachdb/cockroach/pkg/storage_test.TestStoreRangeMergeSlowWatcher.func1(0x7b, 0xa50, 0x100000001, 0x1, 0x15, 0x0, 0xc001870b00, 0x0, 0x0, 0x100000000, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/client_merge_test.go:2877 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc00147c580, 0x87ef420, 0xc001489b00, 0x15, 0x7b, 0xa50, 0x100000001, 0x1, 0x15, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:534 +0x92b
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc00147c580, 0x87ef420, 0xc001489b00, 0x7b, 0xa50, 0x100000001, 0x1, 0x15, 0x0, 0xc001870b00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:491 +0xd1
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc000270000, 0x87ef420, 0xc001489b00, 0x7b, 0xa50, 0x100000001, 0x1, 0x15, 0x0, 0xc001870b00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2972 +0x802
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc0003f2fd0, 0x87ef420, 0xc001489b00, 0x0, 0x0, 0x100000001, 0x1, 0x15, 0x0, 0xc001870b00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:185 +0x14a
github.com/cockroachdb/cockroach/pkg/storage_test.(*multiTestContextKVTransport).SendNext.func1(0x87ef420, 0xc001489b00)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/client_test.go:557 +0xd7
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask(0xc000230140, 0x87ef420, 0xc001489b00, 0x78facce, 0x8, 0xc001111748, 0x0, 0x0)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:287 +0xfc
github.com/cockroachdb/cockroach/pkg/storage_test.(*multiTestContextKVTransport).SendNext(0xc001974cc0, 0x87ef420, 0xc001489b00, 0x0, 0x0, 0x100000001, 0x1, 0x15, 0x0, 0xc001870b00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/client_test.go:556 +0x5fe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc0001f0a20, 0x87ef420, 0xc001489b00, 0xc0001f0a70, 0x15, 0xc0016b8a00, 0x3, 0x3, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1365 +0x3be
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc0001f0a20, 0x87ef420, 0xc001489b00, 0x15, 0xc0016b8a00, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:416 +0x309
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc0001f0a20, 0x87ef420, 0xc001489b00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870a00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:496 +0x20d
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc0001f0a20, 0x87ef420, 0xc001489b00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870a00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1141 +0x423
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc0001f0a20, 0x87ef420, 0xc001489b00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870a00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:798 +0x19d1
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc0001f0a20, 0x87ef420, 0xc001489b00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:710 +0x6eb
github.com/cockroachdb/cockroach/pkg/kv.(*txnLockGatekeeper).SendLocked(0xc000b4f1a0, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:235 +0x160
github.com/cockroachdb/cockroach/pkg/kv.(*txnMetricRecorder).SendLocked(0xc000b4f168, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_metric_recorder.go:51 +0x144
github.com/cockroachdb/cockroach/pkg/kv.(*txnCommitter).SendLocked(0xc000b4f158, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_committer.go:64 +0x153
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc000b4f0d8, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:144 +0xd1
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).SendLocked(0xc000b4f0d8, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:100 +0x191
github.com/cockroachdb/cockroach/pkg/kv.(*txnPipeliner).SendLocked(0xc000b4f050, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_pipeliner.go:184 +0x18a
github.com/cockroachdb/cockroach/pkg/kv.(*txnIntentCollector).SendLocked(0xc000b4f010, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_intent_collector.go:101 +0x59d
github.com/cockroachdb/cockroach/pkg/kv.(*txnSeqNumAllocator).SendLocked(0xc000b4eff8, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_seq_num_allocator.go:93 +0x3d4
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).SendLocked(0xc000b4ef48, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:224 +0x183
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc000b4ed80, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:796 +0x78e
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).sendUsingSender(0xc000163980, 0x87ef420, 0xc001ce5c80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:635 +0x151
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send(0xc0008a4510, 0x87ef420, 0xc001ce5c80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:792 +0x1d2
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).rollback(0xc0008a4510, 0x87ef420, 0xc001ce5c80, 0xc00039c838)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:589 +0x520
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).CleanupOnError(0xc0008a4510, 0x87ef420, 0xc001ce5c80, 0x87b8460, 0xc000a14000)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:506 +0x7c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).AdminMerge(0xc00147c580, 0x87ef420, 0xc001ce5c80, 0xc0002aa648, 0x1, 0x8, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/replica_command.go:523 +0x705
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).executeAdminBatch(0xc00147c580, 0x87ef420, 0xc001ce5c80, 0x7b, 0x338, 0x100000001, 0x1, 0x15, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1259 +0x103c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc00147c580, 0x87ef420, 0xc001ce5c80, 0x15, 0x7b, 0x338, 0x100000001, 0x1, 0x15, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:549 +0x81a
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc00147c580, 0x87ef420, 0xc001ce5c20, 0x7b, 0x338, 0x100000001, 0x1, 0x15, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:491 +0xd1
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc000270000, 0x87ef420, 0xc001ce5c20, 0x7b, 0x338, 0x100000001, 0x1, 0x15, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2972 +0x802
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc0003f2fd0, 0x87ef420, 0xc001ce5b30, 0x0, 0x0, 0x100000001, 0x1, 0x15, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:185 +0x14a
github.com/cockroachdb/cockroach/pkg/storage_test.(*multiTestContextKVTransport).SendNext.func1(0x87ef420, 0xc001ce5b30)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/client_test.go:557 +0xd7
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask(0xc000230140, 0x87ef420, 0xc001ce5b30, 0x78facce, 0x8, 0xc001847278, 0x0, 0x0)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:287 +0xfc
github.com/cockroachdb/cockroach/pkg/storage_test.(*multiTestContextKVTransport).SendNext(0xc000c15700, 0x87ef420, 0xc001ce5b30, 0x0, 0x0, 0x100000001, 0x1, 0x15, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/client_test.go:556 +0x5fe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc0001f0a20, 0x87ef420, 0xc001ce5b30, 0xc0001f0a70, 0x15, 0xc000697090, 0x3, 0x3, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1365 +0x3be
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc0001f0a20, 0x87ef420, 0xc001ce5b30, 0x15, 0xc000697090, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:416 +0x309
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc0001f0a20, 0x87ef420, 0xc001ce5b30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:496 +0x20d
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc0001f0a20, 0x87ef420, 0xc001ce5b30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1141 +0x423
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc0001f0a20, 0x87ef420, 0xc001ce5b30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:798 +0x19d1
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc0001f0a20, 0x87ef420, 0xc001ce5b30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:710 +0x6eb
github.com/cockroachdb/cockroach/pkg/internal/client.SendWrappedWith(0x87ef3a0, 0xc0000e0008, 0x87b8000, 0xc0001f0a20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/internal/client/sender.go:435 +0x156
github.com/cockroachdb/cockroach/pkg/internal/client.SendWrapped(0x87ef3a0, 0xc0000e0008, 0x87b8000, 0xc0001f0a20, 0x88208c0, 0xc0012736c0, 0xc0002aa690, 0x1, 0x8)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/internal/client/sender.go:452 +0xdd
github.com/cockroachdb/cockroach/pkg/storage_test.TestStoreRangeMergeSlowWatcher(0xc0001f8f00)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/client_merge_test.go:2948 +0x1049
testing.tRunner(0xc0001f8f00, 0x7a4e9c8)
	/Users/andrei/go/src/testing/testing.go:827 +0x163
created by testing.(*T).Run
	/Users/andrei/go/src/testing/testing.go:878 +0x65a

nvanbenschoten · 2019-05-01T04:11:44Z

Does the failure rate decrease as you increase the capacity of the channel that we're getting stuck on?

cockroach/pkg/storage/client_merge_test.go

Line 2861 in 468b93b

    
           allowPushTxn := make(chan struct{}, 10) // buffered for headroom if the merge txn restarts

Perhaps the increased load from running the test in parallel is leading to lots of retries of the merge transaction.

See cockroachdb#37191 Release note: None

tbg · 2019-05-01T04:15:05Z

As @andy-kimball pointed out, this might be good for someone who's not Nathan, Andrei, or myself to look at. @andreimatei, wanna delegate?

tbg · 2019-05-01T04:15:14Z

(PS I just sent a PR to skip the test)

37229: storage: skip TestRangeMergeSlowWatcher r=nvanbenschoten a=tbg See #37191 Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

andreimatei · 2019-05-01T14:56:12Z

@darinpp are you interested in this one?

darinpp · 2019-05-01T15:07:45Z

Yes, I'll take a look

andreimatei · 2019-05-01T15:21:18Z

To repro I run something like

make stressrace PKG=./pkg/storage TESTS=TestStoreRangeMergeSlowWatcher STRESSFLAGS="-p 32" TESTTIMEOUT=1m

The failures seems to have two regimes. They happen very quickly since around 5251f6a5c1187c92d4dd4b94e5aa35133231e94a and maybe slower/rarer just before that?
But I didn't do enough runs to say anything with confidence.

andy-kimball · 2019-05-01T16:01:05Z

I'm interested in taking a closer look at this one, since I have some time today. Darin's handing it over; if I can't make progress in time I have I can hand back.

andy-kimball · 2019-05-01T16:02:36Z

I did notice that logs had exactly 10 lines like this:

I190501 15:30:29.482424 11 storage/replica_command.go:417  [s1,r21/1:{a-c}] initiating a merge of r23:{c-/Table/SystemConfigSpan/Start} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=3] into this range (manual)

So it's very likely that it's blocking on:

allowPushTxn := make(chan struct{}, 10)

Still trying to understand why we'd initiate the same merge 10 times in quick succession (roughly once a second).

andy-kimball · 2019-05-10T23:59:51Z

Here is what I think is happening (though there are still gaps in my understanding):

A<-B merge starts and Subsume request locks down B.
Watcher on B sends PushTxn request, which is intercepted by the TestingRequestFilter in the test.
The merge txn aborts due to interference with the replica GC, since it's concurrently reading range descriptors.
The merge txn is retried, but the Watcher on B is locking the range, so it aborts again.
Added diff of an InfoStore with a supplied Filter #4 repeats until the allowPushTxn channel fills up (it has capacity of 10). This causes a deadlock because the merge txn can't continue. Meanwhile, the watcher is blocked waiting for the results of the PushTxn request, which gets blocked waiting for the merge txn.

andy-kimball · 2019-05-11T15:04:27Z

Another thing is that I think a 1 minute timeout might give false repros, which can muddy the waters. Under heavy load, this test runs very slowly, with lots of timeouts and retries. So just because one of the parallel tests times out doesn't mean that we've reproduced the 45 minute deadlock shown in the original teamcity runs. As an illustration of this, I tried setting timeouts low and then increasing them:

If timeout is 30s, test fails after ~30s
If timeout is 40s, test fails after ~1m30s
If timeout is 50s, test fails after ~3m30s

Probably if I set the timeout to 1m, the test would fail after ~10s of minutes of heavy stress, as @andreimatei observed. This doesn't indicate it has deadlocked (i.e. true repro), only that the timeout is too short given the load.

This is an important point because the fix I'm working on still eventually fails with a timeout of 1m, but looking at the logs I believe these are "false" failures. If I increase the timeout to 5m, I don't see any failures during a 1h run I tried.

tbg · 2019-05-16T15:48:26Z

Fixed in #37477

RaduBerinde added the C-test-failure Broken test (automatically or manually discovered). label Apr 29, 2019

RaduBerinde mentioned this issue Apr 29, 2019

all: fix various static checks #37178

Merged

This was referenced Apr 30, 2019

teamcity: failed test: TestStoreRangeMergeSlowWatcher #37223

Closed

storage: package timed out under stress #35635

Closed

andreimatei self-assigned this Apr 30, 2019

tbg added a commit to tbg/cockroach that referenced this issue May 1, 2019

storage: skip TestRangeMergeSlowWatcher

d8120b2

See cockroachdb#37191 Release note: None

tbg mentioned this issue May 1, 2019

storage: skip TestRangeMergeSlowWatcher #37229

Merged

tbg changed the title ~~TestStoreRangeMergeSlowWatcher times out under race~~ TestStoreRangeMergeSlowWatcher times out under race [skipped] May 1, 2019

craig bot pushed a commit that referenced this issue May 1, 2019

Merge #37229

5c1497a

37229: storage: skip TestRangeMergeSlowWatcher r=nvanbenschoten a=tbg See #37191 Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

darinpp assigned darinpp and unassigned andreimatei May 1, 2019

andreimatei mentioned this issue May 1, 2019

teamcity: failed test: TestStoreRangeMergeSlowWatcher #37225

Closed

nvanbenschoten added the E-starter Might be suitable for a starter project for new employees or team members. label May 1, 2019

tbg changed the title ~~TestStoreRangeMergeSlowWatcher times out under race [skipped]~~ storage: TestStoreRangeMergeSlowWatcher times out under race [skipped] May 7, 2019

andy-kimball assigned andy-kimball and unassigned darinpp May 9, 2019

tbg closed this as completed May 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: TestStoreRangeMergeSlowWatcher times out under race [skipped] #37191

storage: TestStoreRangeMergeSlowWatcher times out under race [skipped] #37191

RaduBerinde commented Apr 29, 2019

RaduBerinde commented Apr 29, 2019

tbg commented Apr 29, 2019

andreimatei commented Apr 30, 2019

andreimatei commented May 1, 2019

nvanbenschoten commented May 1, 2019

tbg commented May 1, 2019

tbg commented May 1, 2019

andreimatei commented May 1, 2019

darinpp commented May 1, 2019

andreimatei commented May 1, 2019

andy-kimball commented May 1, 2019

andy-kimball commented May 1, 2019

andy-kimball commented May 10, 2019

andy-kimball commented May 11, 2019 •

edited

Loading

tbg commented May 16, 2019

storage: TestStoreRangeMergeSlowWatcher times out under race [skipped] #37191

storage: TestStoreRangeMergeSlowWatcher times out under race [skipped] #37191

Comments

RaduBerinde commented Apr 29, 2019

RaduBerinde commented Apr 29, 2019

tbg commented Apr 29, 2019

andreimatei commented Apr 30, 2019

andreimatei commented May 1, 2019

nvanbenschoten commented May 1, 2019

tbg commented May 1, 2019

tbg commented May 1, 2019

andreimatei commented May 1, 2019

darinpp commented May 1, 2019

andreimatei commented May 1, 2019

andy-kimball commented May 1, 2019

andy-kimball commented May 1, 2019

andy-kimball commented May 10, 2019

andy-kimball commented May 11, 2019 • edited Loading

tbg commented May 16, 2019

andy-kimball commented May 11, 2019 •

edited

Loading