Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: backup/2TB/n10cpu4 failed #81634

Closed
cockroach-teamcity opened this issue May 22, 2022 · 4 comments
Closed

roachtest: backup/2TB/n10cpu4 failed #81634

cockroach-teamcity opened this issue May 22, 2022 · 4 comments
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) T-disaster-recovery T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 22, 2022

roachtest.backup/2TB/n10cpu4 failed with artifacts on release-22.1 @ 7f8c647f1a51e0fe50271337ef4c6bba898c6a8d:

		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) output in run_063231.213657496_n1_cockroach_sql
		Wraps: (3) ./cockroach sql --insecure -e "
		  | 				BACKUP bank.bank TO 'gs://cockroachdb-backup-testing/teamcity-5230940-1653196817-22-n10cpu4?AUTH=implicit'" returned
		  | stderr:
		  | ERROR: failed to run backup: exporting 1126 ranges: export request timeout: operation "ExportRequest for span /Table/106/1/1302{2606-5212}" timed out after 5m0.001s (given timeout 5m0s): failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease held by different store; r52: replica (n5,s5):8 not lease holder; current lease is repl=(n10,s10):7 seq=6 start=1653199241.651628855,0 epo=1 pro=1653199241.651541384,0
		  | Failed running "sql"
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 1. Command with error:
		  | ``````
		  | ./cockroach sql --insecure -e "
		  | 				BACKUP bank.bank TO 'gs://cockroachdb-backup-testing/teamcity-5230940-1653196817-22-n10cpu4?AUTH=implicit'"
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

	monitor.go:127,backup.go:638,test_runner.go:875: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerBackup.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:638
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:875
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6498
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:238
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

Jira issue: CRDB-15997

Epic CRDB-13428

@cockroach-teamcity cockroach-teamcity added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 22, 2022
@adityamaru
Copy link
Contributor

exporting 1126 ranges: export request timeout: operation "ExportRequest for span /Table/106/1/1302{2606-5212}" timed out after 5m0.001s (given timeout 5m0s): failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease held by different store; r52: replica (n5,s5):8 not lease holder; current lease is repl=(n10,s10):7 seq=6 start=1653199241.651628855,0 epo=1 pro=1653199241.651541384,0

@adityamaru adityamaru self-assigned this May 24, 2022
@adityamaru
Copy link
Contributor

220522 06:10:07.584045 44279 kv/kvserver/pkg/kv/kvserver/queue.go:1096 ⋮ [n10,replicate,s10,r52/7:‹/Table/106/1/1302{2606-3909}›] 1721  ‹operation "replicate queue process replica 52" timed out after 1m0s (given timeout 1m0s)›: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease held by different store; r52: replica (n7,s7):5 not lease holder; current lease is repl=(n10,s10):7 seq=6 start=1653199241.651628855,0 epo=1 pro=1653199241.651541384,0›
220522 06:10:57.550718 39499 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n9,merge,s9,r51/4:‹/Table/106/1/1302{1303-2606}›] 1694  slow range RPC: have been waiting 60.01s (56 attempts) for RPC RangeStats [‹/Table/106/1/13022606›,‹/Min›) to r52:‹/Table/106/1/1302{2606-3909}› [(n5,s5):8, (n10,s10):7VOTER_DEMOTING_LEARNER, (n7,s7):5, (n8,s8):9VOTER_INCOMING, next=10, gen=51, sticky=1653202783.753058328,0]; resp: aborted during DistSender.Send: context deadline exceeded
220522 06:12:47.940051 54090 kv/kvserver/pkg/kv/kvserver/queue.go:1096 ⋮ [n10,replicate,s10,r52/7:‹/Table/106/1/1302{2606-3909}›] 2230  ‹operation "replicate queue process replica 52" timed out after 1m0.008s (given timeout 1m0s)›: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease held by different store; r52: replica (n5,s5):8 not lease holder; current lease is repl=(n10,s10):7 seq=6 start=1653199241.651628855,0 epo=1 pro=1653199241.651541384,0›
220522 06:18:17.432746 74862 kv/kvserver/pkg/kv/kvserver/queue.go:1096 ⋮ [n10,replicate,s10,r52/7:‹/Table/106/1/1302{2606-3909}›] 3042  ‹operation "replicate queue process replica 52" timed out after 1m0.001s (given timeout 1m0s)›: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease held by different store; r52: replica (n8,s8):9VOTER_INCOMING not lease holder; current lease is repl=(n10,s10):7 seq=6 start=1653199241.651628855,0 epo=1 pro=1653199241.651541384,0›
220522 06:23:14.257594 96541 kv/kvserver/pkg/kv/kvserver/queue.go:1096 ⋮ [n10,replicate,s10,r52/7:‹/Table/106/1/1302{2606-3909}›] 3862  ‹operation "replicate queue process replica 52" timed out after 1m0.001s (given timeout 1m0s)›: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease held by different store; r52: replica (n5,s5):8 not lease holder; current lease is repl=(n10,s10):7 seq=6 start=1653199241.651628855,0 epo=1 pro=1653199241.651541384,0›
220522 06:27:34.851676 118347 kv/kvserver/pkg/kv/kvserver/queue.go:1096 ⋮ [n10,replicate,s10,r52/7:‹/Table/106/1/1302{2606-3909}›] 4777  ‹operation "replicate queue process replica 52" timed out after 1m0s (given timeout 1m0s)›: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease held by different store; r52: replica (n8,s8):9VOTER_INCOMING not lease holder; current lease is repl=(n10,s10):7 seq=6 start=1653199241.651628855,0 epo=1 pro=1653199241.651541384,0›
220522 06:30:20.126339 131356 kv/kvserver/pkg/kv/kvserver/queue.go:1096 ⋮ [n10,replicate,s10,r52/7:‹/Table/106/1/1302{2606-3909}›] 5133  ‹operation "replicate queue process replica 52" timed out after 1m0.001s (given timeout 1m0s)›: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease held by different store; r52: replica (n7,s7):5 not lease holder; current lease is repl=(n10,s10):7 seq=6 start=1653199241.651628855,0 epo=1 pro=1653199241.651541384,0›
220522 06:33:35.738027 123181 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n7,f‹016a8a5e›,job=763922850892546049] 5247  slow range RPC: have been waiting 60.15s (56 attempts) for RPC Export [‹/Table/106/1/13022606›,‹/Table/106/1/13023909›
), [wait-policy: Error] to r52:‹/Table/106/1/1302{2606-3909}› [(n5,s5):8, (n10,s10):7VOTER_DEMOTING_LEARNER, (n7,s7):5, (n8,s8):9VOTER_INCOMING, next=10, gen=51, sticky=1653202783.753058328,0]; resp: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease held by different store; r52: replica (n8,s8):9VOTER_INCOMING not lease holder; current lease is repl=(n10,s10):7 seq=6 start=1653199241.651628855,0 epo=1 pro=1653199241.651541384,0›
220522 06:37:37.628849 162077 jobs/registry.go:1134 ⋮ [n1] 9104  BACKUP job 763922850892546049: stepping through state reverting with error: failed to run backup: exporting 1126 ranges: export request timeout: ‹operation "ExportRequest for span /Table/106/1/1302{2606-5212}" timed out after 5m0.001s (given timeout 5m0s)›: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease held by different store; r52: replica (n5,s5):8 not lease holder; current lease is repl=(n10,s10):7 seq=6 start=1653199241.651628855,0 epo=1 pro=1653199241.651541384,0›

Prior to the ExportRequest timing out I see the replication/merge queue timing out on the same replica r52 with NLE errors. This is interesting because the error indicates that n10 on which these timeouts occur is actually the leaseholder of the replica.

@adityamaru
Copy link
Contributor

Another stray line indicating something about r52 is blocking operations:

W220522 06:01:01.669665 10973 kv/kvserver/intentresolver/intent_resolver.go:793 ⋮ [-] 381  failed to gc transaction record: ‹operation "cleanup txn record" timed out after 20.001s (given timeout 20s)›: could not GC completed transaction anchored at ‹/Local/Range/Table/106/1/13022606/RangeDescriptor›: context deadline exceeded

The debug zip doesn't indicate any major intent buildup either. At this point I think I need some KV guidance to take this forward so tagging them in.

@adityamaru adityamaru added A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team labels May 24, 2022
@adityamaru
Copy link
Contributor

I'm going to let KV close this one out in case they want to dig more.

@tbg tbg added the S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) label May 30, 2022
@tbg tbg closed this as completed Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) T-disaster-recovery T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants