Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc/multiregion/survive=region/chaos=true failed #85711

Closed
cockroach-teamcity opened this issue Aug 7, 2022 · 22 comments · Fixed by #88307
Closed

roachtest: tpcc/multiregion/survive=region/chaos=true failed #85711

cockroach-teamcity opened this issue Aug 7, 2022 · 22 comments · Fixed by #88307
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. sync-me T-sql-queries SQL Queries Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 7, 2022

roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ a7c91f06d8ee0fa2096bcd626f689009024947bb:

test artifacts and logs in: /artifacts/tpcc/multiregion/survive=region/chaos=true/run_1
	monitor.go:127,tpcc.go:257,tpcc.go:588,test_runner.go:896: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 7)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:257
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func9
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:588
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1571
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 6: dead (exit status 7)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/multiregion

This test on roachdash | Improve this report!

Jira issue: CRDB-18395

Epic CRDB-19172

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 7, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Aug 7, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ 524fd14da3fefcd849f44a835cc5f88f5dbdadcc:

test artifacts and logs in: /artifacts/tpcc/multiregion/survive=region/chaos=true/run_1
	cluster.go:1930,tpcc.go:169,tpcc.go:174,tpcc.go:220,tpcc.go:587,test_runner.go:896: output in run_141437.675935955_n10_workload_init_tpcc: ./workload init tpcc --warehouses=60 --survival-goal=region --regions=us-east1,us-west1,europe-west2 --partitions=3 {pgurl:1} returned: COMMAND_PROBLEM: exit status 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).RunE
		  | 	main/pkg/cmd/roachtest/cluster.go:1971
		  | main.(*clusterImpl).Run
		  | 	main/pkg/cmd/roachtest/cluster.go:1928
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.setupTPCC.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:169
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.setupTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:174
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:220
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func9
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:587
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:896
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1571
		Wraps: (2) output in run_141437.675935955_n10_workload_init_tpcc
		Wraps: (3) ./workload init tpcc --warehouses=60 --survival-goal=region --regions=us-east1,us-west1,europe-west2 --partitions=3 {pgurl:1} returned
		  | stderr:
		  | I220825 14:14:50.903472 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 9 tables
		  | I220825 14:15:24.615581 11 ccl/workloadccl/fixture.go:481  [-] 2  imported 3.3 KiB in warehouse table (60 rows, 0 index entries, took 16.452358462s, 0.00 MiB/s)
		  | I220825 14:15:25.559706 12 ccl/workloadccl/fixture.go:481  [-] 3  imported 62 KiB in district table (600 rows, 0 index entries, took 17.396976519s, 0.00 MiB/s)
		  | I220825 14:15:26.128098 16 ccl/workloadccl/fixture.go:481  [-] 4  imported 9.3 MiB in new_order table (540000 rows, 0 index entries, took 17.964922088s, 0.52 MiB/s)
		  | I220825 14:15:34.517644 66 ccl/workloadccl/fixture.go:481  [-] 5  imported 7.9 MiB in item table (100000 rows, 0 index entries, took 26.35460564s, 0.30 MiB/s)
		  | I220825 14:15:47.033588 14 ccl/workloadccl/fixture.go:481  [-] 6  imported 137 MiB in history table (1800000 rows, 0 index entries, took 38.870797343s, 3.53 MiB/s)
		  | I220825 14:15:54.948904 13 ccl/workloadccl/fixture.go:481  [-] 7  imported 1.0 GiB in customer table (1800000 rows, 1800000 index entries, took 46.786114715s, 22.89 MiB/s)
		  | I220825 14:15:56.908969 15 ccl/workloadccl/fixture.go:481  [-] 8  imported 107 MiB in order table (1800000 rows, 1800000 index entries, took 48.7460691s, 2.20 MiB/s)
		  | I220825 14:17:23.653164 67 ccl/workloadccl/fixture.go:481  [-] 9  imported 1.8 GiB in stock table (6000000 rows, 0 index entries, took 2m15.490116183s, 13.76 MiB/s)
		  | Error: importing fixture: importing table order_line: pq: for table order_line: validate unique constraint: no inbound stream connection
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 10. Command with error:
		  | ``````
		  | ./workload init tpcc --warehouses=60 --survival-goal=region --regions=us-east1,us-west1,europe-west2 --partitions=3 {pgurl:1}
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ f59620ec646d1181d358d0dc41ab60815ecf59c9:

test artifacts and logs in: /artifacts/tpcc/multiregion/survive=region/chaos=true/run_1
	monitor.go:127,tpcc.go:256,tpcc.go:587,test_runner.go:897: monitor failure: monitor command failure: unexpected node event: 5: dead (exit status 7)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:256
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func9
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:587
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1571
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 5: dead (exit status 7)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@nvanbenschoten
Copy link
Member

In the last failure, a node crashed with the fatal error:

F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 ⋮ [n5,client=10.142.0.45:35178,user=root] 2635  unexpected WriteTooOld request. ba: ‹EndTxn(abort) [/Min], [txn: eab4e265]› (txn: ‹"sql txn" meta={id=eab4e265 key=/Table/111/1/"\xc0"/49/10/0 pri=0.04708306 epo=0 ts=1661701444.082250035,1 min=1661701442.786643568,0 seq=19} lock=true stat=PENDING rts=1661701442.786643568,0 wto=true gul=1661701443.286643568,0›)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !goroutine 1947 [running]:
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x1)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0x8a
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc00216a340, {{{0xc00803d380, 0x24}, {0x512a970, 0x1}, {0x0, 0x0}, {0x0, 0x0}}, 0x170f8ca96027f8ee, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/util/log/clog.go:239 +0x97
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal({0x5ef0a28, 0xc014ab8d50}, 0x2, 0x4, 0x0, 0x0?, {0x51953d5, 0x30}, {0xc01b5e3ca0, 0x2, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:106 +0x645
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(...)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:39
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/util/log/log_channels_generated.go:834
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...}, ...)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 +0x225
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:153 +0x1cb
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:290 +0x2ee
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0xb5
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:232 +0x4ea
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:525 +0x585
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...}, ...)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/db.go:999 +0x156
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/txn.go:1091 +0x225
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv.(*Txn).rollback(0xc00bb4fb80, {0x5ef0a28, 0xc00f79ae70})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/txn.go:856 +0x159
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Rollback(0x3231588?, {0x5ef0a28?, 0xc00f79ae70?})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/txn.go:840 +0x6a
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).rollbackSQLTransaction(0xc017a5d900, {0x5ef0a28, 0xc00f79ae70}, {0x5f187d8, 0x975c988})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1004 +0x4f
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInAbortedState(0xc017a5d900, {0x5ef0a28, 0xc00f79ae70}, {0x5f187d8?, 0x975c988?}, {0x7f0197a6afd8, 0xc017a63bc0})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1704 +0x48e
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc017a5d900, {0x5ef0a28, 0xc00f79ae70}, {{0x5f187d8, 0x975c988}, {0xc021f4c7aa, 0x8}, 0x0, 0x0}, 0x0, ...)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:137 +0x4fd
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func1({{{0x5f187d8, 0x975c988}, {0xc021f4c7aa, 0x8}, 0x0, 0x0}, {0xc0bb0131a48f9649, 0xfbda964e0b, 0x0}, {0xc0bb0131a48f9dd0, ...}, ...}, ...)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1905 +0x305
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc017a5d900)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1909 +0xb88
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc017a5d900, {0x5ef0980, 0xc017851b80}, 0xc000934f00?, 0x0?, 0xc00a0e2630?)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1831 +0x208
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000935280?, {0x5ef0980?, 0xc017851b80?}, {0xc00807c700?}, 0x3?, 0xc017a2e690?)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:824 +0xe6
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1()
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:728 +0x3fe
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:639 +0x22a
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !goroutine 1 [runnable]:
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/cli.waitForShutdown(0xc0005e2610, 0xc000ec0cf0, 0xc0014ed2c0, 0xc0014ec540, 0xc000122940)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/cli/start.go:724 +0x1b8
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/cli.runStart(0x8b24b40, {0x467f34?, 0xc00013fe10?, 0xc00013fd50?}, 0x0)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/cli/start.go:672 +0x858
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/cli.runStartJoin(0xc0005ef960?, {0xc001bec8c0?, 0xc000c0bfa0?, 0x4?})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/cli/start.go:341 +0x25
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 

We'll want KV to investigate, so moving this to KV.

@blathers-crl blathers-crl bot added the T-kv KV Team label Aug 29, 2022
@irfansharif irfansharif self-assigned this Aug 30, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ 4dcb32c0346e20a95847763f89b9b0796d9ed4dc:

test artifacts and logs in: /artifacts/tpcc/multiregion/survive=region/chaos=true/run_1
	cluster.go:1940,tpcc.go:169,tpcc.go:174,tpcc.go:220,tpcc.go:587,test_runner.go:897: output in run_141254.953750325_n10_workload_init_tpcc: ./workload init tpcc --warehouses=60 --survival-goal=region --regions=us-east1,us-west1,europe-west2 --partitions=3 {pgurl:1} returned: COMMAND_PROBLEM: exit status 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).RunE
		  | 	main/pkg/cmd/roachtest/cluster.go:1981
		  | main.(*clusterImpl).Run
		  | 	main/pkg/cmd/roachtest/cluster.go:1938
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.setupTPCC.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:169
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.setupTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:174
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:220
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func9
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:587
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:897
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1571
		Wraps: (2) output in run_141254.953750325_n10_workload_init_tpcc
		Wraps: (3) ./workload init tpcc --warehouses=60 --survival-goal=region --regions=us-east1,us-west1,europe-west2 --partitions=3 {pgurl:1} returned
		  | stderr:
		  | I220903 14:13:08.167928 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 9 tables
		  | I220903 14:13:38.429988 62 ccl/workloadccl/fixture.go:481  [-] 2  imported 7.9 MiB in item table (100000 rows, 0 index entries, took 8.005666814s, 0.98 MiB/s)
		  | I220903 14:13:50.623488 57 ccl/workloadccl/fixture.go:481  [-] 3  imported 62 KiB in district table (600 rows, 0 index entries, took 20.19939162s, 0.00 MiB/s)
		  | I220903 14:13:52.293290 61 ccl/workloadccl/fixture.go:481  [-] 4  imported 9.3 MiB in new_order table (540000 rows, 0 index entries, took 21.868975727s, 0.42 MiB/s)
		  | I220903 14:13:54.635864 56 ccl/workloadccl/fixture.go:481  [-] 5  imported 3.3 KiB in warehouse table (60 rows, 0 index entries, took 24.21180266s, 0.00 MiB/s)
		  | I220903 14:14:18.254727 59 ccl/workloadccl/fixture.go:481  [-] 6  imported 137 MiB in history table (1800000 rows, 0 index entries, took 47.830509204s, 2.87 MiB/s)
		  | I220903 14:14:21.135491 60 ccl/workloadccl/fixture.go:481  [-] 7  imported 107 MiB in order table (1800000 rows, 1800000 index entries, took 50.71117399s, 2.12 MiB/s)
		  | I220903 14:14:25.943483 58 ccl/workloadccl/fixture.go:481  [-] 8  imported 1.0 GiB in customer table (1800000 rows, 1800000 index entries, took 55.519369356s, 19.29 MiB/s)
		  | I220903 14:15:18.625457 64 ccl/workloadccl/fixture.go:481  [-] 9  imported 1.0 GiB in order_line table (18003235 rows, 0 index entries, took 1m48.201102787s, 9.94 MiB/s)
		  | Error: importing fixture: importing table stock: pq: for table stock: validate unique constraint: no inbound stream connection
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 10. Command with error:
		  | ``````
		  | ./workload init tpcc --warehouses=60 --survival-goal=region --regions=us-east1,us-west1,europe-west2 --partitions=3 {pgurl:1}
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@irfansharif
Copy link
Contributor

Looking.

@irfansharif
Copy link
Contributor

irfansharif commented Sep 7, 2022

There are two failure modes in this roachtest.

  • One where the import step fails partway through due to "no inbound stream connection" errors; subject to the 10s timeout here, an error we don't check for explicitly here, which fails the entire import step. Should we? We seem to do so for schema change ops here. I'm not looking at this failure mode -- I think it's the more frequent one. I sanity checked that we're not running chaos events during the import step. Perhaps it's worth investigating why we're unable to setup the inbound stream in 10s, though naively retrying over retryable errors instead of failing the entire import sounds reasonable.
  • One is the following fatal, which I'm still trying to trace through. There's a lot of mutation happening in this code, and the stack trace above indicates there's a rollback happening, so I'm wondering if we're somehow setting the WriteTooOld bit on the txnCoordSender's embedded txn here and later using the txn as part of batch requests here. This is speculative, I didn't try repro-ing or staring at logs so doing that next.

if ba.Txn.WriteTooOld {
// The WriteTooOld flag is not supposed to be set on requests. It's only set
// by the server and it's terminated by this interceptor on the client.
log.Fatalf(ctx, "unexpected WriteTooOld request. ba: %s (txn: %s)",
ba.String(), ba.Txn.String())
}

@irfansharif
Copy link
Contributor

irfansharif commented Sep 7, 2022

Logs weren't informative. Kicking off a few runs with the following additional logs:

diff --git c/pkg/kv/kvclient/kvcoord/txn_coord_sender.go i/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
index 7f3156a332..10f801d0bf 100644
--- c/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
+++ i/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
@@ -130,7 +130,7 @@ type TxnCoordSender struct {
                closed bool

                // txn is the Transaction proto attached to all the requests and updated on
-               // all the responses.
+               // all the responses. // XXX: Is there a response this is updated on?
                txn roachpb.Transaction

                // userPriority is the txn's priority. Used when restarting the transaction.
@@ -519,7 +519,13 @@ func (tc *TxnCoordSender) Send(

        // Clone the Txn's Proto so that future modifications can be made without
        // worrying about synchronization.
+       // XXX: Here? There's code elsewhere that mutates the embedded txn on WTOE
+       // retry. There's a rollback attempt as well. So we could've mucked with
+       // this state on an error.
        ba.Txn = tc.mu.txn.Clone()
+       if ba.Txn.WriteTooOld {
+               log.Infof(ctx, "xxx: set wto bit on batch request's txn")
+       }

        // Send the command through the txnInterceptor stack.
        br, pErr := tc.interceptorStack[0].SendLocked(ctx, ba)
@@ -770,7 +776,7 @@ func (tc *TxnCoordSender) UpdateStateOnRemoteRetryableErr(
 // not be usable afterwards (in case of TransactionAbortedError). The caller is
 // expected to check the ID of the resulting transaction. If the TxnCoordSender
 // can still be used, it will have been prepared for a new epoch.
-func (tc *TxnCoordSender) handleRetryableErrLocked(
+func (tc *TxnCoordSender) handleRetryableErrLocked( // XXX: Latest. Look at callstacks.
        ctx context.Context, pErr *roachpb.Error,
 ) *roachpb.TransactionRetryWithProtoRefreshError {
        // If the error is a transaction retry error, update metrics to
@@ -808,7 +814,7 @@ func (tc *TxnCoordSender) handleRetryableErrLocked(
                tc.metrics.RestartsUnknown.Inc()
        }
        errTxnID := pErr.GetTxn().ID
-       newTxn := roachpb.PrepareTransactionForRetry(ctx, pErr, tc.mu.userPriority, tc.clock)
+       newTxn := roachpb.PrepareTransactionForRetry(ctx, pErr, tc.mu.userPriority, tc.clock) // XXX: Here? We update the embedded txn

        // We'll pass a TransactionRetryWithProtoRefreshError up to the next layer.
        retErr := roachpb.NewTransactionRetryWithProtoRefreshError(
@@ -837,6 +843,9 @@ func (tc *TxnCoordSender) handleRetryableErrLocked(

        // This is where we get a new epoch.
        tc.mu.txn.Update(&newTxn)
+       if tc.mu.txn.WriteTooOld {
+               log.Infof(ctx, "xxx: set wto bit on embedded txn")
+       }

Using:

bin/roachtest run tpcc/multiregion/survive=region/chaos=true --cockroach=./cockroach --debug --count 5

@irfansharif
Copy link
Contributor

Can't read much from the test history, failed first ~ august 7th (when this issue was filed) and failed sporadically since. We've been spamming 22.1 failures but that was something else: #78619 (comment).

@irfansharif
Copy link
Contributor

irfansharif commented Sep 7, 2022

Still speculative since my repros are still running (these are really long running tests and expensive -- 10 nodes!), but my money's on #85101 (+cc @yuzefovich).

@irfansharif
Copy link
Contributor

irfansharif commented Sep 8, 2022

6 runs in parallel, taking 2 hrs each, didn't turn up anything. I'll try just reproducing it more directly next, maybe seeing what codepaths #85101 changed. This feels like a valid release blocker.

@irfansharif
Copy link
Contributor

This code we deleted claims that only 19.2 code "might give us an error with the WriteTooOld flag set"

https://github.com/cockroachdb/cockroach/pull/85101/files#diff-6e1fd44143e2a6ac6ea3984fe9e6e92fbf30da390e3f22dc652fe6bf3b31429cL223-L229

But isn't it possible in master, due to the following sequence:

baHeader.Txn.WriteTooOld = true

pErr := roachpb.NewErrorWithTxn(err, baHeader.Txn)

I'm wholly unfamiliar with this code and am grasping for straws. @yuzefovich, do you mind taking a quick pass to see if the above looks sound to you? We may want to bring back that code in light of this, which you say has data-race implications with the work in #84946.

@nvanbenschoten
Copy link
Member

But isn't it possible in master, due to the following sequence:

Nice debugging. This sound correct to me. Would it be worthwhile to try to write a test that hits that sequence and returns an error where pErr.GetTxn().WriteTooOld == true to the client? It would require a write that hits a WTO error and then a write that hits a hard error (e.g. ConditionFailed) in the same batch, and both within a larger transaction.

If we demonstrate that this is possible, I don't see why we need to revert all of #85101. Isn't reverting the change to txn_interceptor_span_refresher.go sufficient? My reading of that PR (@yuzefovich, please correct me) is that the data race implications were related to the change to newLeafTxnCoordSender, which we wouldn't need to revert.

@yuzefovich
Copy link
Member

My reading of that PR (@yuzefovich, please correct me) is that the data race implications were related to the change to newLeafTxnCoordSender, which we wouldn't need to revert.

Yes, this sounds right to me.

@cockroach-teamcity
Copy link
Member Author

roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ 95677eb5f8d006629b16024fb7d87d55344c1470:

test artifacts and logs in: /artifacts/tpcc/multiregion/survive=region/chaos=true/run_1
	monitor.go:127,tpcc.go:256,tpcc.go:587,test_runner.go:917: monitor failure: monitor command failure: unexpected node event: 7: dead (exit status 7)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:256
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func9
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:587
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 7: dead (exit status 7)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@yuzefovich
Copy link
Member

@irfansharif looks like this is the only beta blocker - AFAIU #87739 should fix it, so it'd be good to merge that change.

@irfansharif
Copy link
Contributor

I haven’t written a test or repro for it but happy to merge to unblock the beta. Want to LGTM?

@irfansharif
Copy link
Contributor

irfansharif commented Sep 19, 2022

#85711 (comment) was mistaken, I missed this defer clause which unsets the WriteTooOld bit at the server side for errors.

defer func() {
// Ensure that errors don't carry the WriteTooOld flag set. The client
// handles non-error responses with the WriteTooOld flag set, and errors
// with this flag set confuse it.
if retErr != nil && retErr.GetTxn() != nil {
retErr.GetTxn().WriteTooOld = false
}
}()

I'm back to being confused about what's happening here.

@irfansharif
Copy link
Contributor

So given the WTO bit is being set in the batch response and not the error (see defer clause above), the part of #85101 we need to revert is changes around newLeafTxnCoordSender which we needed for #84946.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Sep 19, 2022

We should be stripping the WriteTooOld flag off of any successful BatchResponse here:

@irfansharif
Copy link
Contributor

NVM. Nathan pointed out that we combine WTO bits set on specific BatchResponses with errors from others:

// Update the error's transaction with any new information from
// the batch response. This may contain interesting updates if
// the batch was parallelized and part of it succeeded.
pErr.UpdateTxn(br.Txn)

So it's still possible to bubble up a pErr with the WTO bit set up to the client.

irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 20, 2022
Touches cockroachdb#85711 fixing one of the failure modes. In cockroachdb#85101 we deleted
code in the span refresher interceptor that terminated WriteTooOld
flags. We did so assuming these flags were only set in 19.2 servers, but
that's not the case -- TestWTOBitTerminatedOnErrorResponses demonstrates
that it's possible for the server to return error responses with the bit
set if a response is combined with an error from another request in the
same batch request.

Since we were no longer terminating the flag, it was possible to update
the TxnCoordSender's embedded txn with this bit, an then use it when
issuing subsequent batch requests -- something we were asserting
against.

Release note: None
Release justification: Bug fix
craig bot pushed a commit that referenced this issue Sep 20, 2022
87739: kvcoord: (partially) de-flake tpcc/multiregion r=irfansharif a=irfansharif

Touches #85711 fixing one of the failure modes. In #85101 we deleted
code in the span refresher interceptor that terminated WriteTooOld
flags. We did so assuming these flags were only set in 19.2 servers, but
that's not the case -- TestWTOBitTerminatedOnErrorResponses demonstrates
that it's possible for the server to return error responses with the bit
set if a response is combined with an error from another request in the
same batch request.

Since we were no longer terminating the flag, it was possible to update
the TxnCoordSender's embedded txn with this bit, an then use it when
issuing subsequent batch requests -- something we were asserting
against.

Release note: None
Release justification: Bug fix

88174: rowenc: fix needed column families computation for secondary indexes r=yuzefovich a=yuzefovich

Previously, when determining the "minimal set of column families" required to retrieve all of the needed columns for the scan operation we could incorrectly not include the special zeroth family into the set. The KV for the zeroth column family is always present, so it might need to be fetched even when it's not explicitly needed when the "needed" column families are all nullable. Before this patch the code for determining whether all of the needed column families are nullable incorrectly assumed that all columns in a family are stored, but this is only true for the primary indexes - for the secondary indexes only those columns mentioned in `STORING` clause are actually stored (apart from the indexed and PK columns). As a result we could incorrectly not fetch a row if:
- the unique secondary index is used
- the needed column has a NULL value
- all non-nullable columns from the same column family as the needed column are not stored in the index
- other column families are not fetched.

This is now fixed by considering only the set of stored columns.

The bug seems relatively minor since it requires a multitude of conditions to be met, so I don't think it's a TA worthy.

Fixes: #88110.

Release note (bug fix): Previously, CockroachDB could incorrectly not fetch rows with NULL values when reading from the unique secondary index when multiple column families are defined for the table and the index doesn't store some of the NOT NULL columns.

88182: sql: fix relocate commands with NULLs r=yuzefovich a=yuzefovich

Previously, we would crash when evaluating `EXPERIMENTAL_RELOCATE` commands when some of the values involved where NULL, and this is now fixed. There is no release note since the commands are "experimental" after all.

Fixes: #87371.

Release note: None

88187: util/growstack: increase stack growth for 1.19 r=nvanbenschoten a=ajwerner

Now that we've adopted go 1.19, we notice that the performance is much worse (~8%) than we observed in go 1.18. Interestingly, we observe in profiles that we spend a lot more time increasing our stack size underneath request evaluation. This implied to me that some part of this is probably due to the runtime's new stack growth behavior. Perhaps what is going on is that the initial stacks are now smaller than they used to be so when we grow it, we grow it by less than we need to. I ran a benchmark that seems to indicate that this theory is true. I'd like to merge this to master and then backport it after we collect some more data.

We never released this, so no note.

Touches #88038

Release note: None

88195: colbuilder: don't use optimized IN operator for empty tuple r=yuzefovich a=yuzefovich

This commit makes it so that we don't use the optimized IN operator for empty tuples since they handle NULLs incorrectly. This wasn't supposed to happen already due to 9b590d3 but there we only looked at the type and not at the actual datum. This is not a production bug since the optimizer normalizes such expressions away.

Fixes: #88141.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Sep 20, 2022
Touches #85711 fixing one of the failure modes. In #85101 we deleted
code in the span refresher interceptor that terminated WriteTooOld
flags. We did so assuming these flags were only set in 19.2 servers, but
that's not the case -- TestWTOBitTerminatedOnErrorResponses demonstrates
that it's possible for the server to return error responses with the bit
set if a response is combined with an error from another request in the
same batch request.

Since we were no longer terminating the flag, it was possible to update
the TxnCoordSender's embedded txn with this bit, an then use it when
issuing subsequent batch requests -- something we were asserting
against.

Release note: None
Release justification: Bug fix
@irfansharif irfansharif removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. blocks-22.2.0-beta.1 labels Sep 20, 2022
@irfansharif
Copy link
Contributor

Remaining failure mode is listed #85711 (comment). Leaving this issue open to track that.

@irfansharif irfansharif removed their assignment Sep 20, 2022
@yuzefovich
Copy link
Member

In regards to "no inbound stream" error when validating unique constraints at the end of the import: this looks somewhat similar to #87104 since nodes are being randomly restarted (due to the chaos), and we are likely shutting down the node that is part of the distributed query that validates the unique constraints. I agree with Irfan that we should be more resilient here - it's a shame to effectively complete the import and then fail it altogether due to a transient error when validating the constraints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. sync-me T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants