Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: failover/chaos/read-only failed #123736

Closed
cockroach-teamcity opened this issue May 7, 2024 · 1 comment · Fixed by #123820
Closed

roachtest: failover/chaos/read-only failed #123736

cockroach-teamcity opened this issue May 7, 2024 · 1 comment · Fixed by #123820
Assignees
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 7, 2024

roachtest.failover/chaos/read-only failed with artifacts on release-24.1 @ b083c4a9b946e2a7d6a79b46eef77410f0b742ce:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1457
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:335
	            				main/pkg/cmd/roachtest/monitor.go:120
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight load-value:authinfo-roachprod-2-2: context deadline exceeded
	Test:       	failover/chaos/read-only
(require.go:1360).NoError: FailNow called
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
(cluster.go:2349).Run: context canceled
(cluster.go:2349).Run: context canceled
(cluster.go:2349).Run: context canceled
(cluster.go:2349).Run: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-only/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-38514

@cockroach-teamcity cockroach-teamcity added branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels May 7, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone May 7, 2024
@nvanbenschoten nvanbenschoten added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 8, 2024
@arulajmani
Copy link
Collaborator

We see that the n7 was unable to make any outgoing connections around the time of the failure. From the logs:

teamcity-15141330-1715060991-94-n10cpu2-0007> W240507 14:21:05.656082 211 kv/kvserver/liveness/liveness.go:752 ⋮ [T1,Vsystem,n7,liveness-hb] 2647  slow heartbeat took 3.001031147s; err=aborted in DistSender: result is ambiguous: context deadline
 exceeded
teamcity-15141330-1715060991-94-n10cpu2-0007> W240507 14:21:05.656240 211 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n7,liveness-hb] 2648  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 3.001s (give
n timeout 3s): aborted in DistSender: result is ambiguous: context deadline exceeded
teamcity-15141330-1715060991-94-n10cpu2-0007> W240507 14:21:05.656240 211 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n7,liveness-hb] 2648 +
teamcity-15141330-1715060991-94-n10cpu2-0007> W240507 14:21:05.656240 211 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n7,liveness-hb] 2648 +An inability to maintain liveness will prevent a node from participating in a
teamcity-15141330-1715060991-94-n10cpu2-0007> W240507 14:21:05.656240 211 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n7,liveness-hb] 2648 +cluster. If this problem persists, it may be a sign of resource starvation or
teamcity-15141330-1715060991-94-n10cpu2-0007> W240507 14:21:05.656240 211 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n7,liveness-hb] 2648 +of network connectivity problems. For help troubleshooting, visit:
teamcity-15141330-1715060991-94-n10cpu2-0007> W240507 14:21:05.656240 211 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n7,liveness-hb] 2648 +
teamcity-15141330-1715060991-94-n10cpu2-0007> W240507 14:21:05.656240 211 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n7,liveness-hb] 2648 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-iss
ues
teamcity-15141330-1715060991-94-n10cpu2-0004> I240507 14:21:05.670554 292977 kv/kvserver/liveness/liveness.go:1025 ⋮ [T1,Vsystem,n4,s4,r899/8:‹/Table/106/1/63{11698…-30126…}›] 5195  incremented n7 liveness epoch to 2
teamcity-15141330-1715060991-94-n10cpu2-0005> I240507 14:21:05.691984 278216 rpc/context.go:2291 ⋮ [T1,rnode=7,raddr=‹10.142.0.224:29000›,class=system,rpc,dialback] 2585  blocking dialback connection failed to ‹10.142.0.224:29000›, n7, ‹connecti
on error: desc = "transport: error while dialing: dial tcp 10.142.0.224:29000: i/o timeout"›
teamcity-15141330-1715060991-94-n10cpu2-0005> I240507 14:21:05.692075 278216 rpc/heartbeat.go:174 ⋮ [T1,-] 2586  failing ping request from node n7
teamcity-15141330-1715060991-94-n10cpu2-0007> E240507 14:21:05.692653 3281 2@rpc/peer.go:598 ⋮ [T1,Vsystem,n7,rnode=5,raddr=‹10.142.0.202:29000›,class=rangefeed,rpc] 268  failed connection attempt‹ (last connected 4.006s ago)›: grpc: ‹connection
 error: desc = "transport: error while dialing: dial tcp 10.142.0.224:29000: i/o timeout"› [code 2/Unknown]
teamcity-15141330-1715060991-94-n10cpu2-0007> E240507 14:21:05.692653 3281 2@rpc/peer.go:598 ⋮ [T1,Vsystem,n7,rnode=5,raddr=‹10.142.0.202:29000›,class=rangefeed,rpc] 2649  failed connection attempt‹ (last connected 4.006s ago)›: grpc: ‹connectio
n error: desc = "transport: error while dialing: dial tcp 10.142.0.224:29000: i/o timeout"› [code 2/Unknown]
teamcity-15141330-1715060991-94-n10cpu2-0003> I240507 14:21:05.696597 258558 rpc/context.go:2291 ⋮ [T1,rnode=7,raddr=‹10.142.0.224:29000›,class=system,rpc,dialback] 4298  blocking dialback connection failed to ‹10.142.0.224:29000›, n7, ‹connecti
on error: desc = "transport: error while dialing: dial tcp 10.142.0.224:29000: i/o timeout"›
teamcity-15141330-1715060991-94-n10cpu2-0003> I240507 14:21:05.696697 258558 rpc/heartbeat.go:174 ⋮ [T1,-] 4299  failing ping request from node n7
teamcity-15141330-1715060991-94-n10cpu2-0007> E240507 14:21:05.697138 3035 2@rpc/peer.go:598 ⋮ [T1,Vsystem,n7,rnode=3,raddr=‹10.142.1.29:29000›,class=raft,rpc] 269  failed connection attempt‹ (last connected 4.006s ago)›: grpc: ‹connection error
: desc = "transport: error while dialing: dial tcp 10.142.0.224:29000: i/o timeout"› [code 2/Unknown]
teamcity-15141330-1715060991-94-n10cpu2-0007> E240507 14:21:05.697138 3035 2@rpc/peer.go:598 ⋮ [T1,Vsystem,n7,rnode=3,raddr=‹10.142.1.29:29000›,class=raft,rpc] 2650  failed connection attempt‹ (last connected 4.006s ago)›: grpc: ‹connection erro
r: desc = "transport: error while dialing: dial tcp 10.142.0.224:29000: i/o timeout"› [code 2/Unknown]

This is induced by the test:

14:20:54 failover.go:334: failing n7 (blackhole-recv)
14:20:54 cluster.go:2369: running cmd `sudo iptables -A INPUT -p t...` on nodes [:7]; details in run_142054.844353714_n7_sudo-iptables-A-INPU.log

With DistSender circuit breakers disabled, it makes sense why the request from n9 got stuck and eventually timed out. I'll remove the blocker label. However, before we close this out, I think we should turn on DistSender circuit breakers for these failover chaos tests.

@arulajmani arulajmani removed GA-blocker branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 labels May 8, 2024
@arulajmani arulajmani self-assigned this May 8, 2024
arulajmani added a commit to arulajmani/cockroach that referenced this issue May 8, 2024
Failover chaos tests create assymetric partitions, where DistSender
circuit breakers are useful. It prevents failure modes such as
cockroachdb#123736 (comment).

Fixes cockroachdb#123736

Release note: None
craig bot pushed a commit that referenced this issue May 10, 2024
123820: roachtest: turn on DistSender circuit breakers for failover chaos tests r=nicktrav,andrewbaptist a=arulajmani

Failover chaos tests create asymetric partitions where DistSender circuit breakers are useful. It prevents failure modes such as #123736 (comment).

Fixes #123736

Release note: None

Co-authored-by: Arul Ajmani <[email protected]>
@craig craig bot closed this as completed in b66922d May 10, 2024
blathers-crl bot pushed a commit that referenced this issue May 10, 2024
Failover chaos tests create asymetric partitions where DistSender
circuit breakers are useful. It prevents failure modes such as
#123736 (comment).

Fixes #123736

Release note: None
@nvanbenschoten nvanbenschoten added A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels May 10, 2024
@github-project-automation github-project-automation bot moved this to roachtest/unit test backlog in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
No open projects
Status: roachtest/unit test backlog
Development

Successfully merging a pull request may close this issue.

3 participants