Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: TestStoreRangeMergeSlowWatcher times out under race [skipped] #37191

Closed
RaduBerinde opened this issue Apr 29, 2019 · 15 comments
Closed
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). E-starter Might be suitable for a starter project for new employees or team members.

Comments

@RaduBerinde
Copy link
Member

https://teamcity.cockroachdb.com/viewLog.html?buildId=1265447&buildTypeId=Cockroach_UnitTests

https://teamcity.cockroachdb.com/viewLog.html?buildId=1265559&buildTypeId=Cockroach_UnitTests

E190429 15:52:09.327853 2674683 storage/queue.go:974  [replicaGC,s2,r22/2:{b-c}] operation "replicaGC queue process replica 22" timed out after 1m0s
W190429 15:52:12.717140 70760 storage/store_rebalancer.go:227  [s2,store-rebalancer] StorePool missing descriptor for local store
W190429 15:52:13.444342 70523 storage/store_rebalancer.go:227  [s1,store-rebalancer] StorePool missing descriptor for local store
W190429 15:52:23.151384 70897 storage/store_rebalancer.go:227  [s3,store-rebalancer] StorePool missing descriptor for local store
W190429 15:52:58.676598 70760 storage/store_rebalancer.go:227  [s2,store-rebalancer] StorePool missing descriptor for local store
W190429 15:53:07.155575 70523 storage/store_rebalancer.go:227  [s1,store-rebalancer] StorePool missing descriptor for local store
E190429 15:53:09.427880 2756557 storage/queue.go:974  [replicaGC,s2,r22/2:{b-c}] operation "replicaGC queue process replica 22" timed out after 1m0s
W190429 15:53:27.772187 70897 storage/store_rebalancer.go:227  [s3,store-rebalancer] StorePool missing descriptor for local store
W190429 15:54:04.524476 70760 storage/store_rebalancer.go:227  [s2,store-rebalancer] StorePool missing descriptor for local store
W190429 15:54:08.120934 70523 storage/store_rebalancer.go:227  [s1,store-rebalancer] StorePool missing descriptor for local store
E190429 15:54:09.527593 2839030 storage/queue.go:974  [replicaGC,s2,r22/2:{b-c}] operation "replicaGC queue process replica 22" timed out after 1m0s
W190429 15:54:35.671123 70897 storage/store_rebalancer.go:227  [s3,store-rebalancer] StorePool missing descriptor for local store
W190429 15:54:55.440544 70523 storage/store_rebalancer.go:227  [s1,store-rebalancer] StorePool missing descriptor for local store
W190429 15:55:03.928140 70760 storage/store_rebalancer.go:227  [s2,store-rebalancer] StorePool missing descriptor for local store
panic: test timed out after 45m0s
@RaduBerinde RaduBerinde added the C-test-failure Broken test (automatically or manually discovered). label Apr 29, 2019
@RaduBerinde
Copy link
Member Author

I reproed on dff4132 with:

make stressrace PKG=./pkg/storage TESTS=TestStoreRangeMergeSlowWatcher STRESSFLAGS='-p 4' TESTTIMEOUT=2m

E190429 19:33:09.988487 2320 storage/queue.go:974  [replicaGC,s2,r22/2:{b-c}] operation "replicaGC queue process replica 22" timed out after 1m0s
W190429 19:33:10.319865 660 storage/store_rebalancer.go:227  [s3,store-rebalancer] StorePool missing descriptor for local store
W190429 19:33:52.043337 173 storage/store_rebalancer.go:227  [s1,store-rebalancer] StorePool missing descriptor for local store
W190429 19:33:58.312685 660 storage/store_rebalancer.go:227  [s3,store-rebalancer] StorePool missing descriptor for local store
W190429 19:34:03.723694 435 storage/store_rebalancer.go:227  [s2,store-rebalancer] StorePool missing descriptor for local store
panic: test timed out after 2m0s

@tbg
Copy link
Member

tbg commented Apr 29, 2019

👍 thanks for crossing the t, cc @andreimatei

@andreimatei
Copy link
Contributor

I believe this test is likely what's causing our mysterious 1h timeouts for master TC builds. I've reproed it running for 40 min when being run in parallel with other packages (as it is on TC).

@andreimatei andreimatei self-assigned this Apr 30, 2019
@andreimatei
Copy link
Contributor

The badness starts happening around Apr 10th. Exactly when I can't tell yet; I've tricked myself bisecting once already because the amount of time it takes to get a failure seems to vary. In fact, I have yet to find any commit where the stress can survive 5 min with -p 32 on my machine. But it seems to have gotten worse around Apr 10th? I'm pretty confused still.

Every time it times out, this is the trace for the test's goroutine. It's blocked on a channel in a test knob it adds.

goroutine 22 [chan send]:
github.com/cockroachdb/cockroach/pkg/storage_test.TestStoreRangeMergeSlowWatcher.func1(0x7b, 0xa50, 0x100000001, 0x1, 0x15, 0x0, 0xc001870b00, 0x0, 0x0, 0x100000000, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/client_merge_test.go:2877 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc00147c580, 0x87ef420, 0xc001489b00, 0x15, 0x7b, 0xa50, 0x100000001, 0x1, 0x15, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:534 +0x92b
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc00147c580, 0x87ef420, 0xc001489b00, 0x7b, 0xa50, 0x100000001, 0x1, 0x15, 0x0, 0xc001870b00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:491 +0xd1
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc000270000, 0x87ef420, 0xc001489b00, 0x7b, 0xa50, 0x100000001, 0x1, 0x15, 0x0, 0xc001870b00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2972 +0x802
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc0003f2fd0, 0x87ef420, 0xc001489b00, 0x0, 0x0, 0x100000001, 0x1, 0x15, 0x0, 0xc001870b00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:185 +0x14a
github.com/cockroachdb/cockroach/pkg/storage_test.(*multiTestContextKVTransport).SendNext.func1(0x87ef420, 0xc001489b00)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/client_test.go:557 +0xd7
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask(0xc000230140, 0x87ef420, 0xc001489b00, 0x78facce, 0x8, 0xc001111748, 0x0, 0x0)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:287 +0xfc
github.com/cockroachdb/cockroach/pkg/storage_test.(*multiTestContextKVTransport).SendNext(0xc001974cc0, 0x87ef420, 0xc001489b00, 0x0, 0x0, 0x100000001, 0x1, 0x15, 0x0, 0xc001870b00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/client_test.go:556 +0x5fe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc0001f0a20, 0x87ef420, 0xc001489b00, 0xc0001f0a70, 0x15, 0xc0016b8a00, 0x3, 0x3, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1365 +0x3be
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc0001f0a20, 0x87ef420, 0xc001489b00, 0x15, 0xc0016b8a00, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:416 +0x309
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc0001f0a20, 0x87ef420, 0xc001489b00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870a00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:496 +0x20d
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc0001f0a20, 0x87ef420, 0xc001489b00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870a00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1141 +0x423
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc0001f0a20, 0x87ef420, 0xc001489b00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870a00, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:798 +0x19d1
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc0001f0a20, 0x87ef420, 0xc001489b00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:710 +0x6eb
github.com/cockroachdb/cockroach/pkg/kv.(*txnLockGatekeeper).SendLocked(0xc000b4f1a0, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:235 +0x160
github.com/cockroachdb/cockroach/pkg/kv.(*txnMetricRecorder).SendLocked(0xc000b4f168, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_metric_recorder.go:51 +0x144
github.com/cockroachdb/cockroach/pkg/kv.(*txnCommitter).SendLocked(0xc000b4f158, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_committer.go:64 +0x153
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc000b4f0d8, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:144 +0xd1
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).SendLocked(0xc000b4f0d8, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:100 +0x191
github.com/cockroachdb/cockroach/pkg/kv.(*txnPipeliner).SendLocked(0xc000b4f050, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_pipeliner.go:184 +0x18a
github.com/cockroachdb/cockroach/pkg/kv.(*txnIntentCollector).SendLocked(0xc000b4f010, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_intent_collector.go:101 +0x59d
github.com/cockroachdb/cockroach/pkg/kv.(*txnSeqNumAllocator).SendLocked(0xc000b4eff8, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_seq_num_allocator.go:93 +0x3d4
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).SendLocked(0xc000b4ef48, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc001870900, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:224 +0x183
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc000b4ed80, 0x87ef420, 0xc0014899b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:796 +0x78e
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).sendUsingSender(0xc000163980, 0x87ef420, 0xc001ce5c80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:635 +0x151
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send(0xc0008a4510, 0x87ef420, 0xc001ce5c80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:792 +0x1d2
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).rollback(0xc0008a4510, 0x87ef420, 0xc001ce5c80, 0xc00039c838)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:589 +0x520
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).CleanupOnError(0xc0008a4510, 0x87ef420, 0xc001ce5c80, 0x87b8460, 0xc000a14000)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:506 +0x7c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).AdminMerge(0xc00147c580, 0x87ef420, 0xc001ce5c80, 0xc0002aa648, 0x1, 0x8, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/replica_command.go:523 +0x705
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).executeAdminBatch(0xc00147c580, 0x87ef420, 0xc001ce5c80, 0x7b, 0x338, 0x100000001, 0x1, 0x15, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1259 +0x103c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc00147c580, 0x87ef420, 0xc001ce5c80, 0x15, 0x7b, 0x338, 0x100000001, 0x1, 0x15, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:549 +0x81a
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc00147c580, 0x87ef420, 0xc001ce5c20, 0x7b, 0x338, 0x100000001, 0x1, 0x15, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:491 +0xd1
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc000270000, 0x87ef420, 0xc001ce5c20, 0x7b, 0x338, 0x100000001, 0x1, 0x15, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2972 +0x802
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc0003f2fd0, 0x87ef420, 0xc001ce5b30, 0x0, 0x0, 0x100000001, 0x1, 0x15, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:185 +0x14a
github.com/cockroachdb/cockroach/pkg/storage_test.(*multiTestContextKVTransport).SendNext.func1(0x87ef420, 0xc001ce5b30)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/client_test.go:557 +0xd7
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask(0xc000230140, 0x87ef420, 0xc001ce5b30, 0x78facce, 0x8, 0xc001847278, 0x0, 0x0)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:287 +0xfc
github.com/cockroachdb/cockroach/pkg/storage_test.(*multiTestContextKVTransport).SendNext(0xc000c15700, 0x87ef420, 0xc001ce5b30, 0x0, 0x0, 0x100000001, 0x1, 0x15, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/client_test.go:556 +0x5fe
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc0001f0a20, 0x87ef420, 0xc001ce5b30, 0xc0001f0a70, 0x15, 0xc000697090, 0x3, 0x3, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1365 +0x3be
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc0001f0a20, 0x87ef420, 0xc001ce5b30, 0x15, 0xc000697090, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:416 +0x309
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc0001f0a20, 0x87ef420, 0xc001ce5b30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:496 +0x20d
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc0001f0a20, 0x87ef420, 0xc001ce5b30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1141 +0x423
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc0001f0a20, 0x87ef420, 0xc001ce5b30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:798 +0x19d1
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc0001f0a20, 0x87ef420, 0xc001ce5b30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:710 +0x6eb
github.com/cockroachdb/cockroach/pkg/internal/client.SendWrappedWith(0x87ef3a0, 0xc0000e0008, 0x87b8000, 0xc0001f0a20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/internal/client/sender.go:435 +0x156
github.com/cockroachdb/cockroach/pkg/internal/client.SendWrapped(0x87ef3a0, 0xc0000e0008, 0x87b8000, 0xc0001f0a20, 0x88208c0, 0xc0012736c0, 0xc0002aa690, 0x1, 0x8)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/internal/client/sender.go:452 +0xdd
github.com/cockroachdb/cockroach/pkg/storage_test.TestStoreRangeMergeSlowWatcher(0xc0001f8f00)
	/Users/andrei/work/src/github.com/cockroachdb/cockroach/pkg/storage/client_merge_test.go:2948 +0x1049
testing.tRunner(0xc0001f8f00, 0x7a4e9c8)
	/Users/andrei/go/src/testing/testing.go:827 +0x163
created by testing.(*T).Run
	/Users/andrei/go/src/testing/testing.go:878 +0x65a

@nvanbenschoten
Copy link
Member

Does the failure rate decrease as you increase the capacity of the channel that we're getting stuck on?

allowPushTxn := make(chan struct{}, 10) // buffered for headroom if the merge txn restarts

Perhaps the increased load from running the test in parallel is leading to lots of retries of the merge transaction.

tbg added a commit to tbg/cockroach that referenced this issue May 1, 2019
@tbg tbg changed the title TestStoreRangeMergeSlowWatcher times out under race TestStoreRangeMergeSlowWatcher times out under race [skipped] May 1, 2019
@tbg
Copy link
Member

tbg commented May 1, 2019

As @andy-kimball pointed out, this might be good for someone who's not Nathan, Andrei, or myself to look at. @andreimatei, wanna delegate?

@tbg
Copy link
Member

tbg commented May 1, 2019

(PS I just sent a PR to skip the test)

craig bot pushed a commit that referenced this issue May 1, 2019
37229: storage: skip TestRangeMergeSlowWatcher r=nvanbenschoten a=tbg

See #37191

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@andreimatei
Copy link
Contributor

@darinpp are you interested in this one?

@darinpp
Copy link
Contributor

darinpp commented May 1, 2019

Yes, I'll take a look

@darinpp darinpp assigned darinpp and unassigned andreimatei May 1, 2019
@andreimatei
Copy link
Contributor

To repro I run something like

make stressrace PKG=./pkg/storage TESTS=TestStoreRangeMergeSlowWatcher STRESSFLAGS="-p 32" TESTTIMEOUT=1m

The failures seems to have two regimes. They happen very quickly since around 5251f6a5c1187c92d4dd4b94e5aa35133231e94a and maybe slower/rarer just before that?
But I didn't do enough runs to say anything with confidence.

@andy-kimball
Copy link
Contributor

I'm interested in taking a closer look at this one, since I have some time today. Darin's handing it over; if I can't make progress in time I have I can hand back.

@andy-kimball
Copy link
Contributor

I did notice that logs had exactly 10 lines like this:

I190501 15:30:29.482424 11 storage/replica_command.go:417  [s1,r21/1:{a-c}] initiating a merge of r23:{c-/Table/SystemConfigSpan/Start} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=3] into this range (manual)

So it's very likely that it's blocking on:

allowPushTxn := make(chan struct{}, 10)

Still trying to understand why we'd initiate the same merge 10 times in quick succession (roughly once a second).

@nvanbenschoten nvanbenschoten added the E-starter Might be suitable for a starter project for new employees or team members. label May 1, 2019
@tbg tbg changed the title TestStoreRangeMergeSlowWatcher times out under race [skipped] storage: TestStoreRangeMergeSlowWatcher times out under race [skipped] May 7, 2019
@andy-kimball andy-kimball assigned andy-kimball and unassigned darinpp May 9, 2019
@andy-kimball
Copy link
Contributor

Here is what I think is happening (though there are still gaps in my understanding):

  1. A<-B merge starts and Subsume request locks down B.
  2. Watcher on B sends PushTxn request, which is intercepted by the TestingRequestFilter in the test.
  3. The merge txn aborts due to interference with the replica GC, since it's concurrently reading range descriptors.
  4. The merge txn is retried, but the Watcher on B is locking the range, so it aborts again.
  5. Added diff of an InfoStore with a supplied Filter #4 repeats until the allowPushTxn channel fills up (it has capacity of 10). This causes a deadlock because the merge txn can't continue. Meanwhile, the watcher is blocked waiting for the results of the PushTxn request, which gets blocked waiting for the merge txn.

@andy-kimball
Copy link
Contributor

andy-kimball commented May 11, 2019

Another thing is that I think a 1 minute timeout might give false repros, which can muddy the waters. Under heavy load, this test runs very slowly, with lots of timeouts and retries. So just because one of the parallel tests times out doesn't mean that we've reproduced the 45 minute deadlock shown in the original teamcity runs. As an illustration of this, I tried setting timeouts low and then increasing them:

If timeout is 30s, test fails after ~30s
If timeout is 40s, test fails after ~1m30s
If timeout is 50s, test fails after ~3m30s

Probably if I set the timeout to 1m, the test would fail after ~10s of minutes of heavy stress, as @andreimatei observed. This doesn't indicate it has deadlocked (i.e. true repro), only that the timeout is too short given the load.

This is an important point because the fix I'm working on still eventually fails with a timeout of 1m, but looking at the logs I believe these are "false" failures. If I increase the timeout to 5m, I don't see any failures during a 1h run I tried.

@tbg
Copy link
Member

tbg commented May 16, 2019

Fixed in #37477

@tbg tbg closed this as completed May 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). E-starter Might be suitable for a starter project for new employees or team members.
Projects
None yet
Development

No branches or pull requests

6 participants