Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore2TB/nodes=10 failed #53197

Closed
cockroach-teamcity opened this issue Aug 21, 2020 · 13 comments · Fixed by #54468
Closed

roachtest: restore2TB/nodes=10 failed #53197

cockroach-teamcity opened this issue Aug 21, 2020 · 13 comments · Fixed by #54468
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).restore2TB/nodes=10 failed on master@bbbedabbf6ea0b1ff6fc799a0c04a75295a9f4c2:

	cluster.go:2167,restore.go:255,cluster.go:2575,errgroup.go:57: output in run_080728.793_n1_cockroach_sql: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2206091-1597991489-21-n10cpu4:1 -- ./cockroach sql --insecure -e ": context canceled
		(1) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2245
		  | main.(*cluster).Run
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2165
		  | main.registerRestore.func1.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:255
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2575
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_080728.793_n1_cockroach_sql
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2206091-1597991489-21-n10cpu4:1 -- ./cockroach sql --insecure -e "
		  | 				RESTORE csv.bank FROM
		  | 				'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
		  | 				WITH into_db = 'restore2tb'" returned
		  | stderr:
		  |
		  | stdout:
		Wraps: (5) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (6) context canceled
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *secondary.withSecondaryError (6) *errors.errorString

	cluster.go:2597,restore.go:262,test_runner.go:754: monitor failure: monitor task failed: context canceled
		(1) attached stack trace
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2585
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2593
		  | main.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:262
		  | [...repeated from below...]
		Wraps: (2) 1 safe detail enclosed
		Wraps: (3) monitor failure
		Wraps: (4) attached stack trace
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2641
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (5) 1 safe detail enclosed
		Wraps: (6) monitor task failed
		Wraps: (7) context canceled
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *withstack.withStack (5) *safedetails.withSafeDetails (6) *errutil.withMessage (7) *errors.errorString

More

Artifacts: /restore2TB/nodes=10
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 21, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Aug 21, 2020
@pbardea pbardea self-assigned this Aug 21, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).restore2TB/nodes=10 failed on master@ef55609797c92e46eb9efd08facc9db49558291d:

		  | 				'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
		  | 				WITH into_db = 'restore2tb'" returned
		  | stderr:
		  | nd.go:914 +0x2fb fp=0xc0004bfe70 sp=0xc0004bfd98 pc=0x393474b
		  | github.com/spf13/cobra.(*Command).Execute(...)
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | github.com/cockroachdb/cockroach/pkg/cli.Run(...)
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:225
		  | github.com/cockroachdb/cockroach/pkg/cli.Main()
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:67 +0x272 fp=0xc0004bff50 sp=0xc0004bfe70 pc=0x3c95692
		  | main.main()
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/cockroach/main.go:26 +0x20 fp=0xc0004bff60 sp=0xc0004bff50 pc=0x3dea690
		  | runtime.main()
		  | 	/usr/local/go/src/runtime/proc.go:203 +0x206 fp=0xc0004bffe0 sp=0xc0004bff60 pc=0x7b28f6
		  | runtime.goexit()
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc0004bffe8 sp=0xc0004bffe0 pc=0x7e1391
		  |
		  | goroutine 2 [force gc (idle), 2 minutes]:
		  | runtime.gopark(0x56284b8, 0x9371890, 0x1411, 0x1)
		  | 	/usr/local/go/src/runtime/proc.go:304 +0xe0 fp=0xc000090fb0 sp=0xc000090f90 pc=0x7b2cb0
		  | runtime.goparkunlock(...)
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2597,restore.go:262,test_runner.go:754: monitor failure: monitor task failed: context canceled
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2585
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2593
		  | main.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:262
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2641
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (4) monitor task failed
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

More

Artifacts: /restore2TB/nodes=10
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@pbardea
Copy link
Contributor

pbardea commented Aug 26, 2020

Looks like the first failure timed out: test_runner.go:801: test timed out (6h0m0s)|n|n, still investigating.

@cockroach-teamcity
Copy link
Member Author

(roachtest).restore2TB/nodes=10 failed on master@cd8267ca84f6a0267a898220cfaa6630188b45c5:

		  | stderr:
		  | Flag --insecure has been deprecated, it will be removed in a subsequent release.
		  | For details, see: https://go.crdb.dev/issue/53404
		  | ERROR: importing span /Table/51/1/4153{1654/0-3324}: fetching "3-2526.sst": Get https://storage.googleapis.com/cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank/3-2526.sst: status code 500 trying to fetch http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token
		  | Failed running "sql"
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ```
		  |   | ./cockroach sql --insecure -e "
		  |   | 				RESTORE csv.bank FROM
		  |   | 				'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
		  |   | 				WITH into_db = 'restore2tb'"
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *exec.ExitError

	cluster.go:2613,restore.go:262,test_runner.go:754: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2601
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2609
		  | main.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:262
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:754
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2657
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2571
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:5228
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:190
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

More

Artifacts: /restore2TB/nodes=10
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@pbardea
Copy link
Contributor

pbardea commented Aug 31, 2020

Most Recent Failure
The most recent failure looks like GCS wasn't happy that day:

		  | ERROR: importing span /Table/51/1/4153{1654/0-3324}: fetching "3-2526.sst": Get https://storage.googleapis.com/cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank/3-2526.sst: status code 500 trying to fetch http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token

I guess the question here is if we should be tolerant to such failures.


2nd most recent failure
Looking at the failure before the latest it looks like the restore timed out. Based on the stacks it looks like an import request is retrying in DistSender:

goroutine 913610 [select]:
github.com/cockroachdb/cockroach/pkg/util/retry.(*Retry).Next(0xc007f3d8d0, 0xc004789b00)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:128 +0x185
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc000a9c800, 0x60c5840, 0xc013d01710, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1438 +0x23a
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc000a9c800, 0x60c5840, 0xc013d01710, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1127 +0x16b2
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc000a9c800, 0x60c5840, 0xc013d01710, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:766 +0x7a6
github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send(0xc00078e9e0, 0x60c5840, 0xc013d01710, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:219 +0x9f
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc00078e980, 0x60c5840, 0xc013d01710, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:744 +0x13c
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:726
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill(0x60c5840, 0xc013d01710, 0xc007f3eb20, 0xc003b9c500, 0xc002a5b810, 0xc002a5b7e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:654 +0x104
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run(0xc00078e980, 0x60c5840, 0xc013d01710, 0xc003b9c500, 0xc00ac4eb00, 0xc02e4c8000)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:677 +0x9c
github.com/cockroachdb/cockroach/pkg/kv.(*DB).AddSSTable(0xc00078e980, 0x60c5840, 0xc013d01710, 0x4d30f80, 0xc00ac4eae0, 0x4d30f80, 0xc00ac4eb00, 0xc02e4c8000, 0xcc1658, 0x118bf8e, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:639 +0xf3
github.com/cockroachdb/cockroach/pkg/kv/bulk.AddSSTable.func1(0x60c5840, 0xc013d01710, 0xc011d1add0, 0xc0003ad500, 0x606f780, 0xc00078e980, 0x611d760, 0xc00ebe3cb0, 0x2fe06a1f, 0xed6d80e40, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/bulk/sst_batcher.go:424 +0x331
github.com/cockroachdb/cockroach/pkg/kv/bulk.AddSSTable(0x60c5840, 0xc013d01710, 0x606f780, 0xc00078e980, 0xc002a5b7e0, 0x8, 0x8, 0xc002a5b810, 0x9, 0x9, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/bulk/sst_batcher.go:460 +0x590
github.com/cockroachdb/cockroach/pkg/kv/bulk.(*SSTBatcher).doFlush(0xc000c93e00, 0x60c5840, 0xc013d01710, 0x1, 0xc00c985dc8, 0x8, 0x8, 0xc004466780, 0xc007f3f8d8)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/bulk/sst_batcher.go:299 +0x3db
github.com/cockroachdb/cockroach/pkg/kv/bulk.(*SSTBatcher).flushIfNeeded(0xc000c93e00, 0x60c5840, 0xc013d01710, 0xc00c985dc8, 0x8, 0x8, 0x2, 0xc003935500)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/bulk/sst_batcher.go:234 +0x176
github.com/cockroachdb/cockroach/pkg/kv/bulk.(*SSTBatcher).AddMVCCKey(0xc000c93e00, 0x60c5840, 0xc013d01710, 0xc00c985dc8, 0x8, 0x8, 0x151c2d1ed07de6c1, 0x0, 0xc003935500, 0x2808, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/bulk/sst_batcher.go:150 +0x33b
github.com/cockroachdb/cockroach/pkg/ccl/storageccl.evalImport(0x60c5840, 0xc013d01710, 0x623ac80, 0xc006fd2000, 0x162ec2b2e68eb182, 0x0, 0x600000006, 0x6, 0x0, 0x3d81, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/storageccl/import.go:265 +0x124e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeAdminBatch(0xc006fd2000, 0x60c5840, 0xc013d01710, 0xc00f7f53b0, 0x0, 0x0)
...
Full stack
goroutine 913610 [select]:
github.com/cockroachdb/cockroach/pkg/util/retry.(*Retry).Next(0xc007f3d8d0, 0xc004789b00)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:128 +0x185
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc000a9c800, 0x60c5840, 0xc013d01710, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1438 +0x23a
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc000a9c800, 0x60c5840, 0xc013d01710, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1127 +0x16b2
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc000a9c800, 0x60c5840, 0xc013d01710, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:766 +0x7a6
github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send(0xc00078e9e0, 0x60c5840, 0xc013d01710, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:219 +0x9f
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc00078e980, 0x60c5840, 0xc013d01710, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:744 +0x13c
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:726
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill(0x60c5840, 0xc013d01710, 0xc007f3eb20, 0xc003b9c500, 0xc002a5b810, 0xc002a5b7e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:654 +0x104
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run(0xc00078e980, 0x60c5840, 0xc013d01710, 0xc003b9c500, 0xc00ac4eb00, 0xc02e4c8000)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:677 +0x9c
github.com/cockroachdb/cockroach/pkg/kv.(*DB).AddSSTable(0xc00078e980, 0x60c5840, 0xc013d01710, 0x4d30f80, 0xc00ac4eae0, 0x4d30f80, 0xc00ac4eb00, 0xc02e4c8000, 0xcc1658, 0x118bf8e, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:639 +0xf3
github.com/cockroachdb/cockroach/pkg/kv/bulk.AddSSTable.func1(0x60c5840, 0xc013d01710, 0xc011d1add0, 0xc0003ad500, 0x606f780, 0xc00078e980, 0x611d760, 0xc00ebe3cb0, 0x2fe06a1f, 0xed6d80e40, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/bulk/sst_batcher.go:424 +0x331
github.com/cockroachdb/cockroach/pkg/kv/bulk.AddSSTable(0x60c5840, 0xc013d01710, 0x606f780, 0xc00078e980, 0xc002a5b7e0, 0x8, 0x8, 0xc002a5b810, 0x9, 0x9, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/bulk/sst_batcher.go:460 +0x590
github.com/cockroachdb/cockroach/pkg/kv/bulk.(*SSTBatcher).doFlush(0xc000c93e00, 0x60c5840, 0xc013d01710, 0x1, 0xc00c985dc8, 0x8, 0x8, 0xc004466780, 0xc007f3f8d8)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/bulk/sst_batcher.go:299 +0x3db
github.com/cockroachdb/cockroach/pkg/kv/bulk.(*SSTBatcher).flushIfNeeded(0xc000c93e00, 0x60c5840, 0xc013d01710, 0xc00c985dc8, 0x8, 0x8, 0x2, 0xc003935500)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/bulk/sst_batcher.go:234 +0x176
github.com/cockroachdb/cockroach/pkg/kv/bulk.(*SSTBatcher).AddMVCCKey(0xc000c93e00, 0x60c5840, 0xc013d01710, 0xc00c985dc8, 0x8, 0x8, 0x151c2d1ed07de6c1, 0x0, 0xc003935500, 0x2808, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/bulk/sst_batcher.go:150 +0x33b
github.com/cockroachdb/cockroach/pkg/ccl/storageccl.evalImport(0x60c5840, 0xc013d01710, 0x623ac80, 0xc006fd2000, 0x162ec2b2e68eb182, 0x0, 0x600000006, 0x6, 0x0, 0x3d81, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/storageccl/import.go:265 +0x124e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeAdminBatch(0xc006fd2000, 0x60c5840, 0xc013d01710, 0xc00f7f53b0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:616 +0xfd6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc006fd2000, 0x60c5840, 0xc013d016e0, 0x3d81, 0xc00f7f53b0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:105 +0x6d8
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(0xc006fd2000, 0x60c5840, 0xc013d016e0, 0x162ec2b2e68eb182, 0x0, 0x600000006, 0x6, 0x0, 0x3d81, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:37 +0x93
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc00060d800, 0x60c5840, 0xc013d009f0, 0x162ec2b2e68eb182, 0x0, 0x600000006, 0x6, 0x0, 0x3d81, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:194 +0x5af
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc00078ea00, 0x60c5840, 0xc013d009f0, 0x0, 0x0, 0x600000006, 0x6, 0x0, 0x3d81, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:177 +0x108
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x60c5840, 0xc013d00870, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:847 +0x220
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc000b167e0, 0x60c5840, 0xc013d00870, 0x50484c2, 0x10, 0xc00693b600, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:326 +0x140
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000311600, 0x60c5840, 0xc013d00870, 0xc00f7f5320, 0xc013d00870, 0xc00004e000, 0x4ae8fe0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:835 +0x197
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000311600, 0x60c5840, 0xc013d00840, 0xc00f7f5320, 0xc000311600, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:873 +0x9f
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler.func1(0x60c5840, 0xc013d00840, 0x4ff2a00, 0xc00f7f5320, 0x4ff2a00, 0xc00f7f5320, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:10575 +0x86
github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.OpenTracingServerInterceptor.func1(0x60c5840, 0xc013d00840, 0x4ff2a00, 0xc00f7f5320, 0xc0055365c0, 0xc0055365e0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/server.go:44 +0x9ef
google.golang.org/grpc.getChainUnaryHandler.func1(0x60c5840, 0xc013d00840, 0x4ff2a00, 0xc00f7f5320, 0xc00693ba80, 0xf910e8, 0x4d56360, 0xc00683d680)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:921 +0xe7
github.com/cockroachdb/cockroach/pkg/rpc.NewServer.func1(0x60c5840, 0xc013d00840, 0x4ff2a00, 0xc00f7f5320, 0xc0055365c0, 0xc00683d680, 0xc00683d680, 0x20, 0x4afd5e0, 0xc00ded8001)
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:199 +0xa8
google.golang.org/grpc.chainUnaryServerInterceptors.func1(0x60c5840, 0xc013d00840, 0x4ff2a00, 0xc00f7f5320, 0xc0055365c0, 0xc0055365e0, 0xc001e5db58, 0x977468, 0x4d72420, 0xc013d00840)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:907 +0xca
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler(0x4f9f360, 0xc000311600, 0x60c5840, 0xc013d00840, 0xc00ded80c0, 0xc000fe61e0, 0x60c5840, 0xc013d00840, 0xc006f55400, 0x27a)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:10577 +0x14b
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0007a2b60, 0x61331e0, 0xc001992a80, 0xc001dbf600, 0xc001269710, 0x89fcdc0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1082 +0x4fd
google.golang.org/grpc.(*Server).handleStream(0xc0007a2b60, 0x61331e0, 0xc001992a80, 0xc001dbf600, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1405 +0xd25
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc000a87e50, 0xc0007a2b60, 0x61331e0, 0xc001992a80, 0xc001dbf600)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:746 +0xbb
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:744 +0xa1

Which reminds me of #52468 (comment). @adityamaru has been trying to reproduce this stuck behaviour while running restore2TB without much luck.

I tried looking through the debug.zip to see if there were any logs that might help us uncover the root cause but my searching didn't turn up anything. (Going to assign to him for now since he's been digging into it.)

cc @andreimatei in case there's either, since the symptoms look familiar to #52468 (comment):

  1. Anywhere in the debug.zip that might help us see if distsender was actually stuck (and why if it was)
    or
  2. Anything in particular we should collect if we do manage to repro a stuck restore in addition to the dist_sender's logs.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 31, 2020
Release justification: small change improving logging, hoping to help
debug cockroachdb#53197.

This patch moves the logging that the DistSender does when it detects an
RPC that is slow such that it kicks in in more cases:
1) if the RPC was successful (but slow)
2) if the lower layer in the DistSender failed to contact any replica
   and returned a sendError{}. In this case the range descriptor will be
   refreshed and the RPC will be retried, similar to the case in which a
   replica returned RangeKeyMismatchError. The RKME case was already
   covered by the logging, but the sendError case wasn't.

Touches cockroachdb#53197

Release note: None
@andreimatei
Copy link
Contributor

I'm not sure what to say. This looks different from #52468 because here, if goroutine 913610 is indeed looping constantly, it's doing it at a higher level than in #52468. The DistSender has some logging for requests that are constantly retried around there, except it wasn't put in a very good place and so it was missing some cases. I'm trying to improve that in the PR above.
If you catch this problem in the act, the best thing to do is to logspy the goroutine id for a few seconds.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 31, 2020
Release justification: small change improving logging, hoping to help
debug cockroachdb#53197.

This patch moves the logging that the DistSender does when it detects an
RPC that is slow such that it kicks in in more cases:
1) if the RPC was successful (but slow)
2) if the lower layer in the DistSender failed to contact any replica
   and returned a sendError{}. In this case the range descriptor will be
   refreshed and the RPC will be retried, similar to the case in which a
   replica returned RangeKeyMismatchError. The RKME case was already
   covered by the logging, but the sendError case wasn't.

Touches cockroachdb#53197

Release note: None
craig bot pushed a commit that referenced this issue Sep 1, 2020
53705: kvclient: improve logging of slow RPCs in DistSender r=andreimatei a=andreimatei

Release justification: small change improving logging, hoping to help
debug #53197.

This patch moves the logging that the DistSender does when it detects an
RPC that is slow such that it kicks in in more cases:
1) if the RPC was successful (but slow)
2) if the lower layer in the DistSender failed to contact any replica
   and returned a sendError{}. In this case the range descriptor will be
   refreshed and the RPC will be retried, similar to the case in which a
   replica returned RangeKeyMismatchError. The RKME case was already
   covered by the logging, but the sendError case wasn't.

Touches #53197

Release note: None

53721: geo: rebrand geospatial to spatial with minor refactors r=sumeerbhola a=otan

* Rename geospatial to spatial (for outwards facing docs).
* Replace all (args[d].(*tree.DType)) to tree.MustBeDType(args[d])
* Add tests to ensure all geo builtin info has a capital letter and
  full stop.

Release justification: low risk update to new functionality

Release note: None



53778: backupccl: fixes cluster backup error string r=pbardea,dt a=adityamaru

Fixes: #52018

Release justification: low risk, high benefit changes to existing functionality

53785: opt: remove xfunc references and clean up EnsureKey r=RaduBerinde a=mgartner

#### norm: clean up EnsureKey decorrelation function

This commit replaces a single-case type switch statement in EnsureKey
with an `if` with a type assertion. It also adds clariying comments.

Release justification: This is a low-risk change to new and minor
changes to existing functionality.

Release note: None

#### opt: fix comment in xform custom_funcs.go

This commit removes references to the non-existent `xfunc` package.

Release justification: This is a non-production code-change.

Release note: None


53793: sql: skip TestCancelQueryPermissions r=solongordon a=solongordon

This test is exposing more data races. Skipping until we can fix them.

Refs: #53791

Release justification: non-production code changes

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Aditya Maru <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Solon Gordon <[email protected]>
@cockroach-teamcity
Copy link
Member Author

(roachtest).restore2TB/nodes=10 failed on master@9fae2abb48a00c26ac0344f35cd100df7faa2aec:

		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2284
		  | main.(*cluster).Run
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2204
		  | main.registerRestore.func1.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:255
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2626
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) output in run_074331.472_n1_cockroach_sql
		Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2247288-1599199488-21-n10cpu4:1 -- ./cockroach sql --insecure -e "
		  | 				RESTORE csv.bank FROM
		  | 				'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
		  | 				WITH into_db = 'restore2tb'" returned
		  | stderr:
		  | Flag --insecure has been deprecated, it will be removed in a subsequent release.
		  | For details, see: https://go.crdb.dev/issue-v/53404/v20.2
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2648,restore.go:262,test_runner.go:754: monitor failure: monitor task failed: context canceled
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2636
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2644
		  | main.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:262
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2692
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (4) monitor task failed
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

More

Artifacts: /restore2TB/nodes=10

See this test on roachdash
powered by pkg/cmd/internal/issues

@pbardea
Copy link
Contributor

pbardea commented Sep 4, 2020

Looks like another timeout, got a bit of logs this time:

Hanging adminsplit request stack:

goroutine 1075307 [select]:
github.com/cockroachdb/cockroach/pkg/util/retry.(*Retry).Next(0xc001c50d40, 0xc00452a480)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:128 +0x185
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc0004e4400, 0x6152d60, 0xc012eaa6f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1450 +0x273
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc0004e4400, 0x6152d60, 0xc012eaa6f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1127 +0x16b2
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc0004e4400, 0x6152d60, 0xc012eaa6f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:766 +0x7a6
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(0xc001cf1ac0, 0x6152d60, 0xc012eaa6f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:86 +0x118
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(0xc001cf1a88, 0x6152d60, 0xc012eaa6f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:46 +0x9a
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked(0xc001cf1a58, 0x6152d60, 0xc012eaa6f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:126 +0x8b5
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc001cf19b8, 0x6152d60, 0xc012eaa6f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:267 +0xa8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(0xc001cf19b8, 0x6152d60, 0xc012eaa6f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:202 +0x28a
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(0xc001cf18f8, 0x6152d60, 0xc012eaa6f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:252 +0x1a0
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(0xc001cf18d8, 0x6152d60, 0xc012eaa6f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0x219
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(0xc001cf1838, 0x6152d60, 0xc012eaa6f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:172 +0x1bd
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(0xc001cf1680, 0x6152d60, 0xc012eaa6f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:500 +0x413
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc0004a7500, 0x6152d60, 0xc012eaa690, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:744 +0x13c
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(0xc0122f4ab0, 0x6152d60, 0xc012eaa690, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:919 +0x12b
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill(0x6152d60, 0xc012eaa690, 0xc001c53058, 0xc001796000, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:654 +0x104
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Run(0xc0122f4ab0, 0x6152d60, 0xc012eaa690, 0xc001796000, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:590 +0xfa
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Get(0xc0122f4ab0, 0x6152d60, 0xc012eaa690, 0x4d8e1e0, 0xc00504ef20, 0x4, 0x0, 0x0, 0x0, 0xc010ac4940, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:369 +0xbc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.conditionalGetDescValueFromDB(0x6152d60, 0xc012eaa690, 0xc0122f4ab0, 0xc0153dc3d8, 0x6, 0x8, 0xc0046c5860, 0x9789cc, 0xc009d67980, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:2018 +0xf2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.splitTxnStickyUpdateAttempt(0x6152d60, 0xc012eaa690, 0xc0122f4ab0, 0xc00e21bb90, 0x1631897873e4c96e, 0xc000000000, 0x9789cc, 0xc012eaa4e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:262 +0x98
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).adminSplitWithDescriptor.func1(0x6152d60, 0xc012eaa690, 0xc0122f4ab0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:385 +0x58
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1(0x6152d60, 0xc012eaa690, 0xc0122f4ab0, 0x3ff0000000000000, 0x62c9580)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:707 +0x43
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec(0xc0122f4ab0, 0x6152d60, 0xc012eaa690, 0xc001c53730, 0xc0122f4ab0, 0xc000000000)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811 +0xd9
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn(0xc0004a7500, 0x6152d60, 0xc012eaa690, 0xc001c539d8, 0x6, 0x8)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706 +0xfa
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).adminSplitWithDescriptor(0xc00f0dd000, 0x6152d60, 0xc012eaa690, 0xc013230228, 0x6, 0x8, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:384 +0x535
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).AdminSplit.func1(0xc00e21bb90, 0xc00e21bb90, 0xc012eaa690)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:58 +0xf5
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeAdminCommandWithDescriptor(0xc00f0dd000, 0x6152d60, 0xc012eaa690, 0xc001c53e88, 0x4b9b180)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:549 +0x20a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).AdminSplit(0xc00f0dd000, 0x6152d60, 0xc012eaa690, 0xc013230228, 0x6, 0x8, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:56 +0x1a6

Some of the logs:

Grepping the stuck goroutine above:
$ cat nodes/*/logs/* | grep 1075307

W200904 08:09:13.549124 1075307 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n1,s1,r3715/5:‹/Table/53/1/2588030{-/0}›,txn=‹c7bd557d›] slow range RPC: ‹have been waiting 60.42s (65 attempts) for RPC [txn: c7bd557d], [can-forward-ts], Get [/Local/Range/Table/53/1/2588030/RangeDescriptor,/Min) to r247:/Table/53/1/2{499238/0-799217/0} [(n1,s1):1, (n9,s9):2, (n3,s3):3, next=4, gen=44, sticky=1599209017.179184168,1]; resp: failed to send RPC: sending to all replicas failed; last error: r247 was not found on s1›

And all slow range logs:
$ cat nodes/*/logs/* | grep "slow range RPC"

W200904 08:02:29.548152 6070 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n1,job=586989617193713665,restore-distsql] slow range RPC: ‹have been waiting 759.19s (1 attempts) for RPC Import [/Table/53/1/6651521,/Min) to r2261:/Table/53/1/6651521{-/0} [(n1,s1):1, (n10,s10):2, (n8,s8):3, next=4, gen=152, sticky=1599209029.129901140,0]; resp: (err: <nil>), *roachpb.ImportResponse›
W200904 08:09:13.549124 1075307 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n1,s1,r3715/5:‹/Table/53/1/2588030{-/0}›,txn=‹c7bd557d›] slow range RPC: ‹have been waiting 60.42s (65 attempts) for RPC [txn: c7bd557d], [can-forward-ts], Get [/Local/Range/Table/53/1/2588030/RangeDescriptor,/Min) to r247:/Table/53/1/2{499238/0-799217/0} [(n1,s1):1, (n9,s9):2, (n3,s3):3, next=4, gen=44, sticky=1599209017.179184168,1]; resp: failed to send RPC: sending to all replicas failed; last error: r247 was not found on s1›
W200904 08:02:21.588439 514550 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n10,s10,r13040/4:‹/Table/53/1/233{48614…-51886}›] slow range RPC: ‹have been waiting 811.02s (1 attempts) for RPC AdminSplit [/Table/53/1/23348614,/Min) to r17156:/Table/53/1/23{040148-348614/0} [(n4,s4):1, (n10,s10):2, (n7,s7):3, next=4, gen=473, sticky=1599209160.852120104,0]; resp: (err: key range /Table/53/1/23348614-/Table/53/1/23348614 outside of bounds of range /Table/53/1/23040148-/Table/53/1/23040148/0; suggested ranges: [desc: r17156:/Table/53/1/23040148{-/0} [(n4,s4):1, (n10,s10):2, (n7,s7):3, next=4, gen=474, sticky=1599209188.849271700,0] lease: repl=(n10,s10):2 seq=18 start=1599205566.151951687,0 epo=1 pro=1599205566.151954298,0])›
W200904 08:02:27.103194 609 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n10] slow range RPC: ‹have been waiting 817.21s (1 attempts) for RPC Import [/Table/53/1/8140764/0,/Min) to r5474:/Table/53/1/814{0764/0-4037} [(n8,s8):1, (n10,s10):2, (n9,s9):3, next=4, gen=105, sticky=1599209068.781542347,1]; resp: (err: <nil>), *roachpb.ImportResponse›
W200904 08:02:23.280058 1348 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n2] slow range RPC: ‹have been waiting 647.37s (1 attempts) for RPC Import [/Table/53/1/4906141/0,/Min) to r2472:/Table/53/1/490{6141/0-9414} [(n1,s1):4, (n5,s5):2, (n10,s10):3, next=5, gen=274, sticky=1599209033.406791799,0]; resp: (err: <nil>), *roachpb.ImportResponse›
W200904 08:05:28.509396 295036 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n2] slow range RPC: ‹have been waiting 1115.81s (1 attempts) for RPC GC [/Table/53/1/33529706/0,/Table/53/1/33529706/0/NULL) to r21657:/Table/53/1/335{29706/0-32978} [(n2,s2):1, (n6,s6):2, (n1,s1):3, next=4, gen=270, sticky=1599209211.637839257,0]; resp: (err: <nil>), *roachpb.GCResponse›
W200904 08:02:29.550307 1370 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n3] slow range RPC: ‹have been waiting 809.96s (1 attempts) for RPC Import [/Table/53/1/1798878/0,/Min) to r2806:/Table/53/1/1{798878/0-802151} [(n10,s10):1, (n9,s9):2, (n5,s5):3, next=4, gen=243, sticky=1599209037.588061728,0]; resp: (err: <nil>), *roachpb.ImportResponse›
W200904 08:02:26.019307 1035 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n4] slow range RPC: ‹have been waiting 816.54s (1 attempts) for RPC Import [/Table/53/1/1653249/0,/Min) to r2525:/Table/53/1/165{3249/0-6522} [(n10,s10):1, (n3,s3):2, (n5,s5):3, next=4, gen=117, sticky=1599209034.107904073,1]; resp: (err: <nil>), *roachpb.ImportResponse›
W200904 10:07:26.599479 3300777 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n4,merge,s4,r3711/2:‹/Table/53/1/258{4757-8030}›] slow range RPC: ‹have been waiting 60.00s (1 attempts) for RPC AdminRelocateRange [/Table/53/1/2588030,/Min) to r3715:/Table/53/1/2588030{-/0} [(n10,s10):4, (n1,s1):5, (n7,s7):3, next=6, gen=146, sticky=1599209049.308945756,0]; resp: aborted during DistSender.Send: context deadline exceeded›
W200904 08:02:21.522178 1069 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n5] slow range RPC: ‹have been waiting 952.67s (1 attempts) for RPC AdminSplit [/Table/53/1/23040148,/Min) to r17156:/Table/53/1/23{040148-348614/0} [(n4,s4):1, (n10,s10):2, (n7,s7):3, next=4, gen=473, sticky=1599209160.852120104,0]; resp: (err: <nil>), *roachpb.AdminSplitResponse›
W200904 08:02:23.280485 989 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n5] slow range RPC: ‹have been waiting 810.92s (1 attempts) for RPC Import [/Table/53/1/18382918,/Min) to r10224:/Table/53/1/18382918{-/0} [(n10,s10):1, (n9,s9):4, (n5,s5):3, next=5, gen=105, sticky=1599209109.510530053,0]; resp: (err: <nil>), *roachpb.ImportResponse›
W200904 08:02:28.193744 928 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n7] slow range RPC: ‹have been waiting 815.21s (1 attempts) for RPC Import [/Table/53/1/32345180,/Min) to r28742:/Table/53/1/32345180{-/0} [(n4,s4):1, (n1,s1):2, (n10,s10):3, next=4, gen=405, sticky=1599209281.460985107,1]; resp: (err: <nil>), *roachpb.ImportResponse›
W200904 08:02:28.198013 760 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n8] slow range RPC: ‹have been waiting 813.46s (1 attempts) for RPC Import [/Table/53/1/32217250/0,/Min) to r18252:/Table/53/1/322{17250/0-20522} [(n3,s3):1, (n7,s7):2, (n10,s10):3, next=4, gen=312, sticky=1599209241.422306607,0]; resp: (err: <nil>), *roachpb.ImportResponse›
W200904 08:02:28.194162 785 kv/kvclient/kvcoord/dist_sender.go:1493 ⋮ [n9] slow range RPC: ‹have been waiting 806.42s (1 attempts) for RPC Import [/Table/53/1/6012683,/Min) to r562:/Table/53/1/6012683{-/0} [(n9,s9):1, (n10,s10):4, (n2,s2):3, next=5, gen=133, sticky=1599209019.787172750,0]; resp: (err: <nil>), *roachpb.ImportResponse›

@dt
Copy link
Member

dt commented Sep 8, 2020

@pbardea If these are still hanging AdminSplit, should we move this over to KV's project board for now?

@pbardea
Copy link
Contributor

pbardea commented Sep 8, 2020

I think that makes sense -- I'm not sure if there's more investigation that we can do on the Bulk IO side. Happy to help if there's anything that Bulk IO can help with.

@dt dt assigned andreimatei and unassigned pbardea and adityamaru Sep 10, 2020
@andreimatei
Copy link
Contributor

This reproduced for me in 1/20 runs.
What I saw was a node processing an AdminSplit; in the process of processing the split, it was trying to read its range desc. Reading the range desc was retrying over and over because a stale descriptor was being used - the descriptor didn't include the current leaseholder, which was the node sending the request. Every time the DistSender would run through the replicas in the it has descriptor, then evict the descriptor from the cache and read a fresh one from meta2. For some reason, every read seemed to return a stale desc. Unfortunately the node crashed because of #54226 when I was trying to get more info on the meta2 reads, but what I noticed is that the range in question seems to be in some sort of messy state.

r17261 Range _ Debug _ Cockroach Console.pdf

@andreimatei
Copy link
Contributor

andreimatei commented Sep 15, 2020

Finally figured it out. Will have a fix tomorrow.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 16, 2020
This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.

Specifically, the bad scenario was the following:
- DistSender.sendPartialBatch made a copy of the token and handed it to
  sendToReplicas.
- sendToReplicas updated the cached lease through this token (through
  tok.UpdateLease). The updated token was further used by sendToReplicas
  but not returned to sendPartialBatch.
- No replica can serve the request cause the descriptor is stale, and
  control flow gets back to sendPartialBatch.
- sendPartialBatch uses its own, stale, copy of the token to attempt a
  cache eviction. The cache ignores the eviction request, claiming that
  it has a more up to date entry than the one that sendPartialBatch is
  trying to evict (which is true).
- sendPartialBatch then asks the cache for a new descriptor, and the
  cache returns the same one as before. It also returns the lease that
  we have added before, but that doesn't help very much.
- All this can happen over and over again in some case that I don't
  fully understand. In the restore2TB test, what happens is that the
  node that's supposed to be the leaseholder (say, n1), perpetually
  doesn't know about the range, and another node perpetually thinks that
  n1 is the leaseholder. I'm not sure why the latter happens, but this
  range is probably in an unusual state because the request that's
  spinning in this way is an ImportRequest.

This patch fixes the problem by changing the token's interface - now
different token methods take the receiver by pointer and mutate it, and
the DistSender methods share a single the token instance between them.

Fixes cockroachdb#54118
Fixes cockroachdb#53197

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 17, 2020
This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.

Specifically, the bad scenario was the following:
- DistSender.sendPartialBatch made a copy of the token and handed it to
  sendToReplicas.
- sendToReplicas updated the cached lease through this token (through
  tok.UpdateLease). The updated token was further used by sendToReplicas
  but not returned to sendPartialBatch.
- No replica can serve the request cause the descriptor is stale, and
  control flow gets back to sendPartialBatch.
- sendPartialBatch uses its own, stale, copy of the token to attempt a
  cache eviction. The cache ignores the eviction request, claiming that
  it has a more up to date entry than the one that sendPartialBatch is
  trying to evict (which is true).
- sendPartialBatch then asks the cache for a new descriptor, and the
  cache returns the same one as before. It also returns the lease that
  we have added before, but that doesn't help very much.
- All this can happen over and over again in some case that I don't
  fully understand. In the restore2TB test, what happens is that the
  node that's supposed to be the leaseholder (say, n1), perpetually
  doesn't know about the range, and another node perpetually thinks that
  n1 is the leaseholder. I'm not sure why the latter happens, but this
  range is probably in an unusual state because the request that's
  spinning in this way is an ImportRequest.

This patch fixes the problem by changing how descriptor evictions work:
instead of comparing a token's pointer to the current cache entry and
refusing to do anything if they don't match, now evicting a descriptor
compares the descriptor's value with what's in the cache.
The cache code generally moves away from comparing pointers anywhere,
and an EvictionToken stops storing pointers to the cache's insides for
clarity.

Fixes cockroachdb#54118
Fixes cockroachdb#53197

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 17, 2020
This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.

Specifically, the bad scenario was the following:
- DistSender.sendPartialBatch made a copy of the token and handed it to
  sendToReplicas.
- sendToReplicas updated the cached lease through this token (through
  tok.UpdateLease). The updated token was further used by sendToReplicas
  but not returned to sendPartialBatch.
- No replica can serve the request cause the descriptor is stale, and
  control flow gets back to sendPartialBatch.
- sendPartialBatch uses its own, stale, copy of the token to attempt a
  cache eviction. The cache ignores the eviction request, claiming that
  it has a more up to date entry than the one that sendPartialBatch is
  trying to evict (which is true).
- sendPartialBatch then asks the cache for a new descriptor, and the
  cache returns the same one as before. It also returns the lease that
  we have added before, but that doesn't help very much.
- All this can happen over and over again in some case that I don't
  fully understand. In the restore2TB test, what happens is that the
  node that's supposed to be the leaseholder (say, n1), perpetually
  doesn't know about the range, and another node perpetually thinks that
  n1 is the leaseholder. I'm not sure why the latter happens, but this
  range is probably in an unusual state because the request that's
  spinning in this way is an ImportRequest.

This patch fixes the problem by changing how descriptor evictions work:
instead of comparing a token's pointer to the current cache entry and
refusing to do anything if they don't match, now evicting a descriptor
compares the descriptor's value with what's in the cache.
The cache code generally moves away from comparing pointers anywhere,
and an EvictionToken stops storing pointers to the cache's insides for
clarity.

Fixes cockroachdb#54118
Fixes cockroachdb#53197

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 17, 2020
This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.

Specifically, the bad scenario was the following:
- DistSender.sendPartialBatch made a copy of the token and handed it to
  sendToReplicas.
- sendToReplicas updated the cached lease through this token (through
  tok.UpdateLease). The updated token was further used by sendToReplicas
  but not returned to sendPartialBatch.
- No replica can serve the request cause the descriptor is stale, and
  control flow gets back to sendPartialBatch.
- sendPartialBatch uses its own, stale, copy of the token to attempt a
  cache eviction. The cache ignores the eviction request, claiming that
  it has a more up to date entry than the one that sendPartialBatch is
  trying to evict (which is true).
- sendPartialBatch then asks the cache for a new descriptor, and the
  cache returns the same one as before. It also returns the lease that
  we have added before, but that doesn't help very much.
- All this can happen over and over again in some case that I don't
  fully understand. In the restore2TB test, what happens is that the
  node that's supposed to be the leaseholder (say, n1), perpetually
  doesn't know about the range, and another node perpetually thinks that
  n1 is the leaseholder. I'm not sure why the latter happens, but this
  range is probably in an unusual state because the request that's
  spinning in this way is an ImportRequest.

This patch fixes the problem by changing how descriptor evictions work:
instead of comparing a token's pointer to the current cache entry and
refusing to do anything if they don't match, now evicting a descriptor
compares the descriptor's value with what's in the cache.
The cache code generally moves away from comparing pointers anywhere,
and an EvictionToken stops storing pointers to the cache's insides for
clarity.

Fixes cockroachdb#54118
Fixes cockroachdb#53197

Release note: None
craig bot pushed a commit that referenced this issue Sep 17, 2020
54468: kvclient: fix a request routing bug r=andreimatei a=andreimatei

This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.

Specifically, the bad scenario was the following:
- DistSender.sendPartialBatch made a copy of the token and handed it to
  sendToReplicas.
- sendToReplicas updated the cached lease through this token (through
  tok.UpdateLease). The updated token was further used by sendToReplicas
  but not returned to sendPartialBatch.
- No replica can serve the request cause the descriptor is stale, and
  control flow gets back to sendPartialBatch.
- sendPartialBatch uses its own, stale, copy of the token to attempt a
  cache eviction. The cache ignores the eviction request, claiming that
  it has a more up to date entry than the one that sendPartialBatch is
  trying to evict (which is true).
- sendPartialBatch then asks the cache for a new descriptor, and the
  cache returns the same one as before. It also returns the lease that
  we have added before, but that doesn't help very much.
- All this can happen over and over again in some case that I don't
  fully understand. In the restore2TB test, what happens is that the
  node that's supposed to be the leaseholder (say, n1), perpetually
  doesn't know about the range, and another node perpetually thinks that
  n1 is the leaseholder. I'm not sure why the latter happens, but this
  range is probably in an unusual state because the request that's
  spinning in this way is an ImportRequest.

This patch fixes the problem by changing how descriptor evictions work:
instead of comparing a token's pointer to the current cache entry and
refusing to do anything if they don't match, now evicting a descriptor
compares the descriptor's value with what's in the cache.
The cache code generally moves away from comparing pointers anywhere,
and an EvictionToken stops storing pointers to the cache's insides for
clarity.

Fixes #54118
Fixes #53197

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
@craig craig bot closed this as completed in 84fa010 Sep 17, 2020
@andreimatei
Copy link
Contributor

reopening until backported

@andreimatei andreimatei reopened this Sep 18, 2020
@andreimatei
Copy link
Contributor

now backported

arulajmani pushed a commit that referenced this issue Sep 18, 2020
This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.

Specifically, the bad scenario was the following:
- DistSender.sendPartialBatch made a copy of the token and handed it to
  sendToReplicas.
- sendToReplicas updated the cached lease through this token (through
  tok.UpdateLease). The updated token was further used by sendToReplicas
  but not returned to sendPartialBatch.
- No replica can serve the request cause the descriptor is stale, and
  control flow gets back to sendPartialBatch.
- sendPartialBatch uses its own, stale, copy of the token to attempt a
  cache eviction. The cache ignores the eviction request, claiming that
  it has a more up to date entry than the one that sendPartialBatch is
  trying to evict (which is true).
- sendPartialBatch then asks the cache for a new descriptor, and the
  cache returns the same one as before. It also returns the lease that
  we have added before, but that doesn't help very much.
- All this can happen over and over again in some case that I don't
  fully understand. In the restore2TB test, what happens is that the
  node that's supposed to be the leaseholder (say, n1), perpetually
  doesn't know about the range, and another node perpetually thinks that
  n1 is the leaseholder. I'm not sure why the latter happens, but this
  range is probably in an unusual state because the request that's
  spinning in this way is an ImportRequest.

This patch fixes the problem by changing how descriptor evictions work:
instead of comparing a token's pointer to the current cache entry and
refusing to do anything if they don't match, now evicting a descriptor
compares the descriptor's value with what's in the cache.
The cache code generally moves away from comparing pointers anywhere,
and an EvictionToken stops storing pointers to the cache's insides for
clarity.

Fixes #54118
Fixes #53197

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants