Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: schemachange/index/tpcc/w=800 failed [deadlock in kvflowhandle] #106078

Closed
cockroach-teamcity opened this issue Jul 3, 2023 · 8 comments · Fixed by #106411
Closed
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 3, 2023

roachtest.schemachange/index/tpcc/w=800 failed with artifacts on master @ aacba20d325e5702836e9a76be646b5f1bd922af:

(test_runner.go:1075).runTest: test timed out (6h0m0s)
(schemachange.go:500).runAndLogStmts: dial tcp 34.73.203.138:26257: connect: connection refused
(monitor.go:137).Wait: monitor failure: monitor task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/schemachange/index/tpcc/w=800/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-29388

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Jul 3, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jul 3, 2023
@rafiss
Copy link
Collaborator

rafiss commented Jul 6, 2023

Requesting KV help on this. There are approx ~11,000 messages like this in node 3's logs:

W230703 20:31:02.731020 369 kv/kvserver/split_trigger_helper.go:146 â‹® [T1,n3,s3,r385/3:{-},raft] 107150  would have dropped incoming MsgApp to wait for split trigger, but allowing because uninitialized replica was created 3h27m14.92854264s (>20s) ago
W230703 20:31:02.731127 413 kv/kvserver/split_trigger_helper.go:146 â‹® [T1,n3,s3,r372/2:{-},raft] 107151  would have dropped incoming MsgApp to wait for split trigger, but allowing because uninitialized replica was created 5h52m21.800404629s (>20s) ago
W230703 20:31:02.731296 342 kv/kvserver/split_trigger_helper.go:146 â‹® [T1,n3,s3,r384/3:{-},raft] 107152  would have dropped incoming MsgApp to wait for split trigger, but allowing because uninitialized replica was created 5h51m27.058127428s (>20s) ago
W230703 20:31:02.731870 439 kv/kvserver/split_trigger_helper.go:146 â‹® [T1,n3,s3,r373/3:{-},raft] 107153  would have dropped incoming MsgApp to wait for split trigger, but allowing because uninitialized replica was created 5h52m16.215725182s (>20s) ago
W230703 20:31:03.229971 489 kv/kvserver/raft_transport.go:926 â‹® [T1,n3] 107154  raft send queue to n1 is full

@rafiss rafiss added T-kv KV Team and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Jul 6, 2023
@andyyang890
Copy link
Collaborator

Seems to be related to #106285?

@shralex
Copy link
Contributor

shralex commented Jul 7, 2023

@irfansharif can you take a look ?

@irfansharif
Copy link
Contributor

Saw similar logging in #106140, I wonder if this is related to the investigations over at #106108 now that #106285 was addressed (just today, we haven't had a nightly run yet).

@tbg
Copy link
Member

tbg commented Jul 7, 2023

From n3 stacks

google.golang.org/grpc/internal/transport.(*writeQuota).get(...) [select, 354 minutes]
* goroutine 619 [select, 354 minutes]:
* google.golang.org/grpc/internal/transport.(*writeQuota).get(...)
* 	google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/flowcontrol.go:59
* google.golang.org/grpc/internal/transport.(*http2Client).Write(0xc002370240, 0xc0024d5320, {0xc0b7ac8418, 0x5, 0x5}, {0xc0368ff500, 0xa26, 0xa80}, 0xc0b7ac841d)
* 	google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/http2_client.go:1051 +0x228
* google.golang.org/grpc.(*csAttempt).sendMsg(0xc0023b05b0, {0x5e6b1a0?, 0xc001340660}, {0xc0b7ac8418, 0x5, 0x5}, {0xc0368ff500, 0xa26, 0xa80}, {0xc038c66000, ...})
* 	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:1020 +0x2f3
* google.golang.org/grpc.(*clientStream).SendMsg.func2(0x50?)
* 	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:875 +0x53
* google.golang.org/grpc.(*clientStream).withRetry(0xc0024d4ea0, 0xc03e57f980, 0xc0bf577598)
* 	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:735 +0x3be
* google.golang.org/grpc.(*clientStream).SendMsg(0xc0024d4ea0, {0x5e6b1a0?, 0xc001340660})
* 	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:877 +0x325
* github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.(*tracingClientStream).SendMsg(0xc000e04fa0, {0x5e6b1a0?, 0xc001340660?})
* 	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:385 +0x33
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*multiRaftRaftMessageBatchClient).Send(0xc000234690?, 0x5e6b060?)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/bazel-out/k8-opt/bin/pkg/kv/kvserver/kvserver_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvserver/storage_services.pb.go:124 +0x2b
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).processQueue(0xc00125ca80, 0xc00149c8e8, {0x780b1c8, 0xc007f43490}, 0x0)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:799 +0xc50
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).startProcessNewQueue.func2({0x77c54b0, 0xc001340150})
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:995 +0x405
* runtime/pprof.Do({0x77c54b0?, 0xc0013400f0?}, {{0xc000e04e40?, 0xc0020f06b8?, 0xc88786?}}, 0xc000e04de0)
* 	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).startProcessNewQueue.func3({0x77c54b0, 0xc0013400f0})
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:1001 +0x1c8
* github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
* 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146
* created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
* 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b

I am guessing that some peer has stopped consuming messages.

On n1,

quotapool.(*AbstractPool).Acquire() [select, 233 minutes]
* goroutine 2709337 [select, 233 minutes]:
* github.com/cockroachdb/cockroach/pkg/util/quotapool.(*AbstractPool).Acquire(0xc0560898c0, {0x77c54b0, 0xc098b78cf0}, {0x7796b18, 0xc09247d410})
* 	github.com/cockroachdb/cockroach/pkg/util/quotapool/quotapool.go:281 +0x75c
* github.com/cockroachdb/cockroach/pkg/util/quotapool.(*IntPool).acquireMaybeWait(0xc03b100630, {0x77c54b0, 0xc098b78cf0}, 0x9b, 0x1)
* 	github.com/cockroachdb/cockroach/pkg/util/quotapool/intpool.go:178 +0x13f
* github.com/cockroachdb/cockroach/pkg/util/quotapool.(*IntPool).Acquire(...)
* 	github.com/cockroachdb/cockroach/pkg/util/quotapool/intpool.go:147
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).maybeAcquireProposalQuota(0xc0429be000, {0x77c54b0, 0xc098b78cf0}, 0x77c54b0?, 0xc098b78cf0?)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal_quota.go:70 +0x24d
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose(0xc0429be000, {0x77c54b0, 0xc098b78cf0}, 0xc085794480, 0xc0c3eab2c0, 0xc085e84100, {{0x0, 0x0, 0x0}, {0x0, ...}}, ...)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:279 +0xf98
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc0429be000, {0x77c54b0, 0xc098b78cf0}, 0xc085794480, 0xc0c3eab2c0)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:175 +0x7ca
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc0429be000, {0x77c54b0, 0xc098b78cf0}, 0xc085794480, 0x64e3df8)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:508 +0x383
* github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes(0xc0429be000, {0x77c54b0?, 0xc098b78cc0?}, 0xc085794480)
*

I found what looks like a deadlock in kvflowcontrol.Handle in n1s logs, which would explain why n3's raft transport to n1 is perma-stalled:

Details
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹goroutine 3497 [semacquire, 354 minutes]:›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹sync.runtime_SemacquireMutex(0x252?, 0x18?, 0x0?)›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹	GOROOT/src/runtime/sema.go:77 +0x25›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹sync.(*Mutex).lockSlow(0xc0c41d3610)›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹	GOROOT/src/sync/mutex.go:171 +0x165›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹sync.(*Mutex).Lock(...)›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹	GOROOT/src/sync/mutex.go:90›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹github.com/cockroachdb/cockroach/pkg/kv/kvserver/kvflowcontrol/kvflowhandle.(*Handle).ReturnTokensUpto(0xc0c41d35e0, {0x77c54b0, 0xc005bf1ce0}, 0x88?, {0x472079?, 0x5e6b1a0?}, {{0x0?}, 0x0?})›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹	github.com/cockroachdb/cockroach/pkg/kv/kvserver/kvflowcontrol/kvflowhandle/kvflowhandle.go:199 +0xb8›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc00115a900, {0x77c54b0, 0xc005bf1ce0}, 0xc0f0ffe160, {0x7772820, 0xc00635a210})›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:428 +0x153›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1(0xc00115a900, {0x780b220?, 0xc0064fc8a0}, {0x77c54b0, 0xc005bf1ce0})›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:523 +0x348›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1({0x77c54b0?, 0xc005bf1ce0?})›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:531 +0x45›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx›
I230703 20:33:13.526324 98 util/log/clog.go:410 ⋮ [-] 11615 +‹	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b›

There are a few goroutines trying to get that mutex. There's also this log msg:

E230703 14:38:36.145049 2293094 kv/kvserver/kvflowcontrol/kvflowhandle/kvflowhandle.go:100 ⋮ [T1,n1] 1033 operating on a closed handle

That error message actually straight up leaks the mutex:

h.mu.Lock()
if h.mu.closed {
log.Errorf(ctx, "operating on a closed handle")
return nil
}

@irfansharif I assume you'll take it from here.

@tbg tbg changed the title roachtest: schemachange/index/tpcc/w=800 failed roachtest: schemachange/index/tpcc/w=800 failed [deadlock in kvflowhandle] Jul 7, 2023
@chengxiong-ruan
Copy link
Contributor

This is the same problem as #106349 I assume?

@rafiss
Copy link
Collaborator

rafiss commented Jul 7, 2023

Perhaps #106369, #106387, #106353, #106331, and #106330 too

@irfansharif
Copy link
Contributor

Bor-sed a fix, I'll comb through these open CI issues. I don't understand why this tripped up now though, things have been enabled for a few weeks. Sorry for the noise!

craig bot pushed a commit that referenced this issue Jul 7, 2023
106206: prereqs: delete tests r=rail a=rickystewart

These tests have always been skipped under Bazel because the implementation doesn't work in a Bazel world due to the dependency on `"golang.org/x/tools/go/packages"`. Since the command is only useful/ used in `make`, which is going to be deleted shortly, just delete the tests rather than waste time getting it working.

Also this is the last `broken_in_bazel` test, so rip out all the corresponding
logic too.

Epic: none
Release note: None
Closes: #61924
Closes: #92814

106343: ci: don't do retries on tests in CI on master, release branches r=rail a=rickystewart

First of all, test retries don't even have the correct behavior: #103042
This means that a successfully-retried test tramples the logs of
previously-failed tests, which is very confusing and erases your ability
to debug the test.

Also, we are focusing on quality and wiping out flaky and skipped tests.
This to me suggests we should not be retrying tests to let already-flaky
tests through. Rather, we should be surfacing real failures immediately.

For both of these reasons I turn off test retries for unit tests on
`master` and release branches.

We keep it for `staging` so `bors` is unaffected.

Epic: none
Release note: None

106411: kvflowhandle: fix mutex leak r=irfansharif a=irfansharif

Fixes #106078.

We were forgetting to unlock the mutex on the error path.

Release note: None

Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
@craig craig bot closed this as completed in c7f3f8f Jul 7, 2023
@andrewbaptist andrewbaptist added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv Anything in KV that doesn't belong in a more specific category. labels Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants