roachtest: tpcc/multiregion/survive=region/chaos=true failed #85711

cockroach-teamcity · 2022-08-07T15:24:57Z

roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ a7c91f06d8ee0fa2096bcd626f689009024947bb:

test artifacts and logs in: /artifacts/tpcc/multiregion/survive=region/chaos=true/run_1
	monitor.go:127,tpcc.go:257,tpcc.go:588,test_runner.go:896: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 7)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:257
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func9
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:588
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1571
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 6: dead (exit status 7)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: tpcc/multiregion/survive=region/chaos=true failed #78619 roachtest: tpcc/multiregion/survive=region/chaos=true failed [C-test-failure O-roachtest O-robot branch-release-22.1]

/cc @cockroachdb/multiregion _{This test on roachdash | Improve this report!

Jira issue: CRDB-18395
Epic CRDB-19172}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2022-08-25T14:23:25Z

roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ 524fd14da3fefcd849f44a835cc5f88f5dbdadcc:

test artifacts and logs in: /artifacts/tpcc/multiregion/survive=region/chaos=true/run_1
	cluster.go:1930,tpcc.go:169,tpcc.go:174,tpcc.go:220,tpcc.go:587,test_runner.go:896: output in run_141437.675935955_n10_workload_init_tpcc: ./workload init tpcc --warehouses=60 --survival-goal=region --regions=us-east1,us-west1,europe-west2 --partitions=3 {pgurl:1} returned: COMMAND_PROBLEM: exit status 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).RunE
		  | 	main/pkg/cmd/roachtest/cluster.go:1971
		  | main.(*clusterImpl).Run
		  | 	main/pkg/cmd/roachtest/cluster.go:1928
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.setupTPCC.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:169
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.setupTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:174
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:220
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func9
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:587
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:896
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1571
		Wraps: (2) output in run_141437.675935955_n10_workload_init_tpcc
		Wraps: (3) ./workload init tpcc --warehouses=60 --survival-goal=region --regions=us-east1,us-west1,europe-west2 --partitions=3 {pgurl:1} returned
		  | stderr:
		  | I220825 14:14:50.903472 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 9 tables
		  | I220825 14:15:24.615581 11 ccl/workloadccl/fixture.go:481  [-] 2  imported 3.3 KiB in warehouse table (60 rows, 0 index entries, took 16.452358462s, 0.00 MiB/s)
		  | I220825 14:15:25.559706 12 ccl/workloadccl/fixture.go:481  [-] 3  imported 62 KiB in district table (600 rows, 0 index entries, took 17.396976519s, 0.00 MiB/s)
		  | I220825 14:15:26.128098 16 ccl/workloadccl/fixture.go:481  [-] 4  imported 9.3 MiB in new_order table (540000 rows, 0 index entries, took 17.964922088s, 0.52 MiB/s)
		  | I220825 14:15:34.517644 66 ccl/workloadccl/fixture.go:481  [-] 5  imported 7.9 MiB in item table (100000 rows, 0 index entries, took 26.35460564s, 0.30 MiB/s)
		  | I220825 14:15:47.033588 14 ccl/workloadccl/fixture.go:481  [-] 6  imported 137 MiB in history table (1800000 rows, 0 index entries, took 38.870797343s, 3.53 MiB/s)
		  | I220825 14:15:54.948904 13 ccl/workloadccl/fixture.go:481  [-] 7  imported 1.0 GiB in customer table (1800000 rows, 1800000 index entries, took 46.786114715s, 22.89 MiB/s)
		  | I220825 14:15:56.908969 15 ccl/workloadccl/fixture.go:481  [-] 8  imported 107 MiB in order table (1800000 rows, 1800000 index entries, took 48.7460691s, 2.20 MiB/s)
		  | I220825 14:17:23.653164 67 ccl/workloadccl/fixture.go:481  [-] 9  imported 1.8 GiB in stock table (6000000 rows, 0 index entries, took 2m15.490116183s, 13.76 MiB/s)
		  | Error: importing fixture: importing table order_line: pq: for table order_line: validate unique constraint: no inbound stream connection
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 10. Command with error:
		  | ``````
		  | ./workload init tpcc --warehouses=60 --survival-goal=region --regions=us-east1,us-west1,europe-west2 --partitions=3 {pgurl:1}
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: tpcc/multiregion/survive=region/chaos=true failed #78619 roachtest: tpcc/multiregion/survive=region/chaos=true failed [C-test-failure O-roachtest O-robot branch-release-22.1]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2022-08-28T15:47:41Z

roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ f59620ec646d1181d358d0dc41ab60815ecf59c9:

test artifacts and logs in: /artifacts/tpcc/multiregion/survive=region/chaos=true/run_1
	monitor.go:127,tpcc.go:256,tpcc.go:587,test_runner.go:897: monitor failure: monitor command failure: unexpected node event: 5: dead (exit status 7)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:256
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func9
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:587
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1571
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 5: dead (exit status 7)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: tpcc/multiregion/survive=region/chaos=true failed #78619 roachtest: tpcc/multiregion/survive=region/chaos=true failed [C-test-failure O-roachtest O-robot branch-release-22.1]

_{This test on roachdash | Improve this report!}

nvanbenschoten · 2022-08-29T15:14:54Z

In the last failure, a node crashed with the fatal error:

F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635  unexpected WriteTooOld request. ba: â€¹EndTxn(abort) [/Min], [txn: eab4e265]â€º (txn: â€¹"sql txn" meta={id=eab4e265 key=/Table/111/1/"\xc0"/49/10/0 pri=0.04708306 epo=0 ts=1661701444.082250035,1 min=1661701442.786643568,0 seq=19} lock=true stat=PENDING rts=1661701442.786643568,0 wto=true gul=1661701443.286643568,0â€º)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !goroutine 1947 [running]:
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x1)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0x8a
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc00216a340, {{{0xc00803d380, 0x24}, {0x512a970, 0x1}, {0x0, 0x0}, {0x0, 0x0}}, 0x170f8ca96027f8ee, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/util/log/clog.go:239 +0x97
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal({0x5ef0a28, 0xc014ab8d50}, 0x2, 0x4, 0x0, 0x0?, {0x51953d5, 0x30}, {0xc01b5e3ca0, 0x2, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:106 +0x645
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(...)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:39
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/util/log/log_channels_generated.go:834
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...}, ...)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 +0x225
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:153 +0x1cb
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:290 +0x2ee
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0xb5
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:232 +0x4ea
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:525 +0x585
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...}, ...)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/db.go:999 +0x156
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0}, ...}, ...})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/txn.go:1091 +0x225
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv.(*Txn).rollback(0xc00bb4fb80, {0x5ef0a28, 0xc00f79ae70})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/txn.go:856 +0x159
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Rollback(0x3231588?, {0x5ef0a28?, 0xc00f79ae70?})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/kv/txn.go:840 +0x6a
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).rollbackSQLTransaction(0xc017a5d900, {0x5ef0a28, 0xc00f79ae70}, {0x5f187d8, 0x975c988})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1004 +0x4f
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInAbortedState(0xc017a5d900, {0x5ef0a28, 0xc00f79ae70}, {0x5f187d8?, 0x975c988?}, {0x7f0197a6afd8, 0xc017a63bc0})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1704 +0x48e
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc017a5d900, {0x5ef0a28, 0xc00f79ae70}, {{0x5f187d8, 0x975c988}, {0xc021f4c7aa, 0x8}, 0x0, 0x0}, 0x0, ...)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:137 +0x4fd
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func1({{{0x5f187d8, 0x975c988}, {0xc021f4c7aa, 0x8}, 0x0, 0x0}, {0xc0bb0131a48f9649, 0xfbda964e0b, 0x0}, {0xc0bb0131a48f9dd0, ...}, ...}, ...)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1905 +0x305
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc017a5d900)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1909 +0xb88
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc017a5d900, {0x5ef0980, 0xc017851b80}, 0xc000934f00?, 0x0?, 0xc00a0e2630?)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1831 +0x208
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000935280?, {0x5ef0980?, 0xc017851b80?}, {0xc00807c700?}, 0x3?, 0xc017a2e690?)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:824 +0xe6
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1()
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:728 +0x3fe
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:639 +0x22a
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !goroutine 1 [runnable]:
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/cli.waitForShutdown(0xc0005e2610, 0xc000ec0cf0, 0xc0014ed2c0, 0xc0014ec540, 0xc000122940)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/cli/start.go:724 +0x1b8
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/cli.runStart(0x8b24b40, {0x467f34?, 0xc00013fe10?, 0xc00013fd50?}, 0x0)
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/cli/start.go:672 +0x858
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !github.com/cockroachdb/cockroach/pkg/cli.runStartJoin(0xc0005ef960?, {0xc001bec8c0?, 0xc000c0bfa0?, 0x4?})
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635 !	github.com/cockroachdb/cockroach/pkg/cli/start.go:341 +0x25
F220828 15:44:06.613596 1947 kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 â‹® [n5,client=10.142.0.45:35178,user=root] 2635

We'll want KV to investigate, so moving this to KV.

cockroach-teamcity · 2022-09-03T14:20:27Z

roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ 4dcb32c0346e20a95847763f89b9b0796d9ed4dc:

test artifacts and logs in: /artifacts/tpcc/multiregion/survive=region/chaos=true/run_1
	cluster.go:1940,tpcc.go:169,tpcc.go:174,tpcc.go:220,tpcc.go:587,test_runner.go:897: output in run_141254.953750325_n10_workload_init_tpcc: ./workload init tpcc --warehouses=60 --survival-goal=region --regions=us-east1,us-west1,europe-west2 --partitions=3 {pgurl:1} returned: COMMAND_PROBLEM: exit status 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).RunE
		  | 	main/pkg/cmd/roachtest/cluster.go:1981
		  | main.(*clusterImpl).Run
		  | 	main/pkg/cmd/roachtest/cluster.go:1938
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.setupTPCC.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:169
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.setupTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:174
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:220
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func9
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:587
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:897
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1571
		Wraps: (2) output in run_141254.953750325_n10_workload_init_tpcc
		Wraps: (3) ./workload init tpcc --warehouses=60 --survival-goal=region --regions=us-east1,us-west1,europe-west2 --partitions=3 {pgurl:1} returned
		  | stderr:
		  | I220903 14:13:08.167928 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 9 tables
		  | I220903 14:13:38.429988 62 ccl/workloadccl/fixture.go:481  [-] 2  imported 7.9 MiB in item table (100000 rows, 0 index entries, took 8.005666814s, 0.98 MiB/s)
		  | I220903 14:13:50.623488 57 ccl/workloadccl/fixture.go:481  [-] 3  imported 62 KiB in district table (600 rows, 0 index entries, took 20.19939162s, 0.00 MiB/s)
		  | I220903 14:13:52.293290 61 ccl/workloadccl/fixture.go:481  [-] 4  imported 9.3 MiB in new_order table (540000 rows, 0 index entries, took 21.868975727s, 0.42 MiB/s)
		  | I220903 14:13:54.635864 56 ccl/workloadccl/fixture.go:481  [-] 5  imported 3.3 KiB in warehouse table (60 rows, 0 index entries, took 24.21180266s, 0.00 MiB/s)
		  | I220903 14:14:18.254727 59 ccl/workloadccl/fixture.go:481  [-] 6  imported 137 MiB in history table (1800000 rows, 0 index entries, took 47.830509204s, 2.87 MiB/s)
		  | I220903 14:14:21.135491 60 ccl/workloadccl/fixture.go:481  [-] 7  imported 107 MiB in order table (1800000 rows, 1800000 index entries, took 50.71117399s, 2.12 MiB/s)
		  | I220903 14:14:25.943483 58 ccl/workloadccl/fixture.go:481  [-] 8  imported 1.0 GiB in customer table (1800000 rows, 1800000 index entries, took 55.519369356s, 19.29 MiB/s)
		  | I220903 14:15:18.625457 64 ccl/workloadccl/fixture.go:481  [-] 9  imported 1.0 GiB in order_line table (18003235 rows, 0 index entries, took 1m48.201102787s, 9.94 MiB/s)
		  | Error: importing fixture: importing table stock: pq: for table stock: validate unique constraint: no inbound stream connection
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 10. Command with error:
		  | ``````
		  | ./workload init tpcc --warehouses=60 --survival-goal=region --regions=us-east1,us-west1,europe-west2 --partitions=3 {pgurl:1}
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: tpcc/multiregion/survive=region/chaos=true failed #78619 roachtest: tpcc/multiregion/survive=region/chaos=true failed [C-test-failure O-roachtest O-robot branch-release-22.1]

_{This test on roachdash | Improve this report!}

irfansharif · 2022-09-07T17:55:05Z

Looking.

irfansharif · 2022-09-07T20:20:12Z

There are two failure modes in this roachtest.

One where the import step fails partway through due to "no inbound stream connection" errors; subject to the 10s timeout here, an error we don't check for explicitly here, which fails the entire import step. Should we? We seem to do so for schema change ops here. I'm not looking at this failure mode -- I think it's the more frequent one. I sanity checked that we're not running chaos events during the import step. Perhaps it's worth investigating why we're unable to setup the inbound stream in 10s, though naively retrying over retryable errors instead of failing the entire import sounds reasonable.
One is the following fatal, which I'm still trying to trace through. There's a lot of mutation happening in this code, and the stack trace above indicates there's a rollback happening, so I'm wondering if we're somehow setting the WriteTooOld bit on the txnCoordSender's embedded txn here and later using the txn as part of batch requests here. This is speculative, I didn't try repro-ing or staring at logs so doing that next.

cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go

Lines 219 to 224 in 4e8b0bc

    
           if ba.Txn.WriteTooOld { 
        
           	// The WriteTooOld flag is not supposed to be set on requests. It's only set 
        
           	// by the server and it's terminated by this interceptor on the client. 
        
           	log.Fatalf(ctx, "unexpected WriteTooOld request. ba: %s (txn: %s)", 
        
           		ba.String(), ba.Txn.String()) 
        
           }

irfansharif · 2022-09-07T20:51:56Z

Logs weren't informative. Kicking off a few runs with the following additional logs:

diff --git c/pkg/kv/kvclient/kvcoord/txn_coord_sender.go i/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
index 7f3156a332..10f801d0bf 100644
--- c/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
+++ i/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
@@ -130,7 +130,7 @@ type TxnCoordSender struct {
                closed bool

                // txn is the Transaction proto attached to all the requests and updated on
-               // all the responses.
+               // all the responses. // XXX: Is there a response this is updated on?
                txn roachpb.Transaction

                // userPriority is the txn's priority. Used when restarting the transaction.
@@ -519,7 +519,13 @@ func (tc *TxnCoordSender) Send(

        // Clone the Txn's Proto so that future modifications can be made without
        // worrying about synchronization.
+       // XXX: Here? There's code elsewhere that mutates the embedded txn on WTOE
+       // retry. There's a rollback attempt as well. So we could've mucked with
+       // this state on an error.
        ba.Txn = tc.mu.txn.Clone()
+       if ba.Txn.WriteTooOld {
+               log.Infof(ctx, "xxx: set wto bit on batch request's txn")
+       }

        // Send the command through the txnInterceptor stack.
        br, pErr := tc.interceptorStack[0].SendLocked(ctx, ba)
@@ -770,7 +776,7 @@ func (tc *TxnCoordSender) UpdateStateOnRemoteRetryableErr(
 // not be usable afterwards (in case of TransactionAbortedError). The caller is
 // expected to check the ID of the resulting transaction. If the TxnCoordSender
 // can still be used, it will have been prepared for a new epoch.
-func (tc *TxnCoordSender) handleRetryableErrLocked(
+func (tc *TxnCoordSender) handleRetryableErrLocked( // XXX: Latest. Look at callstacks.
        ctx context.Context, pErr *roachpb.Error,
 ) *roachpb.TransactionRetryWithProtoRefreshError {
        // If the error is a transaction retry error, update metrics to
@@ -808,7 +814,7 @@ func (tc *TxnCoordSender) handleRetryableErrLocked(
                tc.metrics.RestartsUnknown.Inc()
        }
        errTxnID := pErr.GetTxn().ID
-       newTxn := roachpb.PrepareTransactionForRetry(ctx, pErr, tc.mu.userPriority, tc.clock)
+       newTxn := roachpb.PrepareTransactionForRetry(ctx, pErr, tc.mu.userPriority, tc.clock) // XXX: Here? We update the embedded txn

        // We'll pass a TransactionRetryWithProtoRefreshError up to the next layer.
        retErr := roachpb.NewTransactionRetryWithProtoRefreshError(
@@ -837,6 +843,9 @@ func (tc *TxnCoordSender) handleRetryableErrLocked(

        // This is where we get a new epoch.
        tc.mu.txn.Update(&newTxn)
+       if tc.mu.txn.WriteTooOld {
+               log.Infof(ctx, "xxx: set wto bit on embedded txn")
+       }

Using:

bin/roachtest run tpcc/multiregion/survive=region/chaos=true --cockroach=./cockroach --debug --count 5

irfansharif · 2022-09-07T22:44:25Z

Can't read much from the test history, failed first ~ august 7th (when this issue was filed) and failed sporadically since. We've been spamming 22.1 failures but that was something else: #78619 (comment).

irfansharif · 2022-09-07T22:49:37Z

Still speculative since my repros are still running (these are really long running tests and expensive -- 10 nodes!), but my money's on #85101 (+cc @yuzefovich).

irfansharif · 2022-09-08T00:19:35Z

6 runs in parallel, taking 2 hrs each, didn't turn up anything. I'll try just reproducing it more directly next, maybe seeing what codepaths #85101 changed. This feels like a valid release blocker.

irfansharif · 2022-09-08T00:44:51Z

This code we deleted claims that only 19.2 code "might give us an error with the WriteTooOld flag set"

https://github.com/cockroachdb/cockroach/pull/85101/files#diff-6e1fd44143e2a6ac6ea3984fe9e6e92fbf30da390e3f22dc652fe6bf3b31429cL223-L229

But isn't it possible in master, due to the following sequence:

cockroach/pkg/kv/kvserver/replica_evaluate.go

Line 364 in ad17678

baHeader.Txn.WriteTooOld = true

cockroach/pkg/kv/kvserver/replica_evaluate.go

Line 390 in ad17678

pErr := roachpb.NewErrorWithTxn(err, baHeader.Txn)

I'm wholly unfamiliar with this code and am grasping for straws. @yuzefovich, do you mind taking a quick pass to see if the above looks sound to you? We may want to bring back that code in light of this, which you say has data-race implications with the work in #84946.

nvanbenschoten · 2022-09-08T15:06:01Z

But isn't it possible in master, due to the following sequence:

Nice debugging. This sound correct to me. Would it be worthwhile to try to write a test that hits that sequence and returns an error where pErr.GetTxn().WriteTooOld == true to the client? It would require a write that hits a WTO error and then a write that hits a hard error (e.g. ConditionFailed) in the same batch, and both within a larger transaction.

If we demonstrate that this is possible, I don't see why we need to revert all of #85101. Isn't reverting the change to txn_interceptor_span_refresher.go sufficient? My reading of that PR (@yuzefovich, please correct me) is that the data race implications were related to the change to newLeafTxnCoordSender, which we wouldn't need to revert.

yuzefovich · 2022-09-08T15:14:53Z

My reading of that PR (@yuzefovich, please correct me) is that the data race implications were related to the change to newLeafTxnCoordSender, which we wouldn't need to revert.

Yes, this sounds right to me.

cockroach-teamcity · 2022-09-14T14:50:21Z

roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ 95677eb5f8d006629b16024fb7d87d55344c1470:

test artifacts and logs in: /artifacts/tpcc/multiregion/survive=region/chaos=true/run_1
	monitor.go:127,tpcc.go:256,tpcc.go:587,test_runner.go:917: monitor failure: monitor command failure: unexpected node event: 7: dead (exit status 7)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:256
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func9
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:587
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 7: dead (exit status 7)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: tpcc/multiregion/survive=region/chaos=true failed #87595 roachtest: tpcc/multiregion/survive=region/chaos=true failed [C-test-failure O-roachtest O-robot T-kv branch-release-22.1]

_{This test on roachdash | Improve this report!}

yuzefovich · 2022-09-17T01:44:34Z

@irfansharif looks like this is the only beta blocker - AFAIU #87739 should fix it, so it'd be good to merge that change.

irfansharif · 2022-09-17T01:59:45Z

I haven’t written a test or repro for it but happy to merge to unblock the beta. Want to LGTM?

irfansharif · 2022-09-19T15:55:00Z

#85711 (comment) was mistaken, I missed this defer clause which unsets the WriteTooOld bit at the server side for errors.

cockroach/pkg/kv/kvserver/replica_evaluate.go

Lines 157 to 164 in d06a355

    
           defer func() { 
        
           	// Ensure that errors don't carry the WriteTooOld flag set. The client 
        
           	// handles non-error responses with the WriteTooOld flag set, and errors 
        
           	// with this flag set confuse it. 
        
           	if retErr != nil && retErr.GetTxn() != nil { 
        
           		retErr.GetTxn().WriteTooOld = false 
        
           	} 
        
           }()

I'm back to being confused about what's happening here.

irfansharif · 2022-09-19T16:19:10Z

So given the WTO bit is being set in the batch response and not the error (see defer clause above), the part of #85101 we need to revert is changes around newLeafTxnCoordSender which we needed for #84946.

nvanbenschoten · 2022-09-19T17:54:28Z

We should be stripping the WriteTooOld flag off of any successful BatchResponse here:

cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go

Line 251 in 4af4359

bumpedTxn.WriteTooOld = false

irfansharif · 2022-09-19T18:03:55Z

NVM. Nathan pointed out that we combine WTO bits set on specific BatchResponses with errors from others:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 1289 to 1292 in c8c29f7

    
           // Update the error's transaction with any new information from 
        
           // the batch response. This may contain interesting updates if 
        
           // the batch was parallelized and part of it succeeded. 
        
           pErr.UpdateTxn(br.Txn)

So it's still possible to bubble up a pErr with the WTO bit set up to the client.

Touches cockroachdb#85711 fixing one of the failure modes. In cockroachdb#85101 we deleted code in the span refresher interceptor that terminated WriteTooOld flags. We did so assuming these flags were only set in 19.2 servers, but that's not the case -- TestWTOBitTerminatedOnErrorResponses demonstrates that it's possible for the server to return error responses with the bit set if a response is combined with an error from another request in the same batch request. Since we were no longer terminating the flag, it was possible to update the TxnCoordSender's embedded txn with this bit, an then use it when issuing subsequent batch requests -- something we were asserting against. Release note: None Release justification: Bug fix

87739: kvcoord: (partially) de-flake tpcc/multiregion r=irfansharif a=irfansharif Touches #85711 fixing one of the failure modes. In #85101 we deleted code in the span refresher interceptor that terminated WriteTooOld flags. We did so assuming these flags were only set in 19.2 servers, but that's not the case -- TestWTOBitTerminatedOnErrorResponses demonstrates that it's possible for the server to return error responses with the bit set if a response is combined with an error from another request in the same batch request. Since we were no longer terminating the flag, it was possible to update the TxnCoordSender's embedded txn with this bit, an then use it when issuing subsequent batch requests -- something we were asserting against. Release note: None Release justification: Bug fix 88174: rowenc: fix needed column families computation for secondary indexes r=yuzefovich a=yuzefovich Previously, when determining the "minimal set of column families" required to retrieve all of the needed columns for the scan operation we could incorrectly not include the special zeroth family into the set. The KV for the zeroth column family is always present, so it might need to be fetched even when it's not explicitly needed when the "needed" column families are all nullable. Before this patch the code for determining whether all of the needed column families are nullable incorrectly assumed that all columns in a family are stored, but this is only true for the primary indexes - for the secondary indexes only those columns mentioned in `STORING` clause are actually stored (apart from the indexed and PK columns). As a result we could incorrectly not fetch a row if: - the unique secondary index is used - the needed column has a NULL value - all non-nullable columns from the same column family as the needed column are not stored in the index - other column families are not fetched. This is now fixed by considering only the set of stored columns. The bug seems relatively minor since it requires a multitude of conditions to be met, so I don't think it's a TA worthy. Fixes: #88110. Release note (bug fix): Previously, CockroachDB could incorrectly not fetch rows with NULL values when reading from the unique secondary index when multiple column families are defined for the table and the index doesn't store some of the NOT NULL columns. 88182: sql: fix relocate commands with NULLs r=yuzefovich a=yuzefovich Previously, we would crash when evaluating `EXPERIMENTAL_RELOCATE` commands when some of the values involved where NULL, and this is now fixed. There is no release note since the commands are "experimental" after all. Fixes: #87371. Release note: None 88187: util/growstack: increase stack growth for 1.19 r=nvanbenschoten a=ajwerner Now that we've adopted go 1.19, we notice that the performance is much worse (~8%) than we observed in go 1.18. Interestingly, we observe in profiles that we spend a lot more time increasing our stack size underneath request evaluation. This implied to me that some part of this is probably due to the runtime's new stack growth behavior. Perhaps what is going on is that the initial stacks are now smaller than they used to be so when we grow it, we grow it by less than we need to. I ran a benchmark that seems to indicate that this theory is true. I'd like to merge this to master and then backport it after we collect some more data. We never released this, so no note. Touches #88038 Release note: None 88195: colbuilder: don't use optimized IN operator for empty tuple r=yuzefovich a=yuzefovich This commit makes it so that we don't use the optimized IN operator for empty tuples since they handle NULLs incorrectly. This wasn't supposed to happen already due to 9b590d3 but there we only looked at the type and not at the actual datum. This is not a production bug since the optimizer normalizes such expressions away. Fixes: #88141. Release note: None Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

Touches #85711 fixing one of the failure modes. In #85101 we deleted code in the span refresher interceptor that terminated WriteTooOld flags. We did so assuming these flags were only set in 19.2 servers, but that's not the case -- TestWTOBitTerminatedOnErrorResponses demonstrates that it's possible for the server to return error responses with the bit set if a response is combined with an error from another request in the same batch request. Since we were no longer terminating the flag, it was possible to update the TxnCoordSender's embedded txn with this bit, an then use it when issuing subsequent batch requests -- something we were asserting against. Release note: None Release justification: Bug fix

irfansharif · 2022-09-20T04:16:50Z

Remaining failure mode is listed #85711 (comment). Leaving this issue open to track that.

yuzefovich · 2022-09-20T22:06:19Z

In regards to "no inbound stream" error when validating unique constraints at the end of the import: this looks somewhat similar to #87104 since nodes are being randomly restarted (due to the chaos), and we are likely shutting down the node that is part of the distributed query that validates the unique constraints. I agree with Irfan that we should be more resilient here - it's a shame to effectively complete the import and then fail it altogether due to a transient error when validating the constraints.

cockroach-teamcity added this to the 22.2 milestone Aug 7, 2022

cockroach-teamcity mentioned this issue Aug 9, 2022

roachtest: tpcc/multiregion/survive=region/chaos=true failed #78619

Closed

blathers-crl bot added the T-kv KV Team label Aug 29, 2022

irfansharif self-assigned this Aug 30, 2022

celiala added the blocks-22.2.0-beta.1 label Sep 6, 2022

irfansharif assigned yuzefovich Sep 8, 2022

blathers-crl bot added the T-sql-queries SQL Queries Team label Sep 8, 2022

irfansharif added T-sql-queries SQL Queries Team and removed T-sql-queries SQL Queries Team labels Sep 8, 2022

cockroach-teamcity mentioned this issue Sep 8, 2022

roachtest: tpcc/multiregion/survive=region/chaos=true failed #87595

Closed

irfansharif mentioned this issue Sep 9, 2022

kvcoord: (partially) de-flake tpcc/multiregion #87739

Merged

nvanbenschoten mentioned this issue Sep 19, 2022

roachtest: tpcc/multiregion/survive=zone/chaos=true failed #87846

Closed

blathers-crl bot mentioned this issue Sep 20, 2022

release-22.2: kvcoord: (partially) de-flake tpcc/multiregion #88211

Merged

irfansharif removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. blocks-22.2.0-beta.1 labels Sep 20, 2022

irfansharif removed their assignment Sep 20, 2022

yuzefovich mentioned this issue Sep 20, 2022

sql: retry validation of unique constraint #88307

Merged

cockroach-teamcity mentioned this issue Sep 21, 2022

roachtest: tpcc/multiregion/survive=region/chaos=true failed #88350

Closed

exalate-issue-sync bot removed the T-kv KV Team label Sep 26, 2022

craig bot closed this as completed in 5990da6 Sep 30, 2022

blathers-crl bot mentioned this issue Sep 30, 2022

release-22.2: sql: retry validation of unique constraint #89130

Merged

rytaft added the sync-me label Jul 11, 2023

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpcc/multiregion/survive=region/chaos=true failed #85711

roachtest: tpcc/multiregion/survive=region/chaos=true failed #85711

cockroach-teamcity commented Aug 7, 2022 •

edited by exalate-issue-sync bot

Loading

cockroach-teamcity commented Aug 25, 2022

cockroach-teamcity commented Aug 28, 2022

nvanbenschoten commented Aug 29, 2022

cockroach-teamcity commented Sep 3, 2022

irfansharif commented Sep 7, 2022

irfansharif commented Sep 7, 2022 •

edited

Loading

irfansharif commented Sep 7, 2022 •

edited

Loading

irfansharif commented Sep 7, 2022

irfansharif commented Sep 7, 2022 •

edited

Loading

irfansharif commented Sep 8, 2022 •

edited

Loading

irfansharif commented Sep 8, 2022

nvanbenschoten commented Sep 8, 2022

yuzefovich commented Sep 8, 2022

cockroach-teamcity commented Sep 14, 2022

yuzefovich commented Sep 17, 2022

irfansharif commented Sep 17, 2022

irfansharif commented Sep 19, 2022 •

edited

Loading

irfansharif commented Sep 19, 2022

nvanbenschoten commented Sep 19, 2022 •

edited

Loading

irfansharif commented Sep 19, 2022

irfansharif commented Sep 20, 2022

yuzefovich commented Sep 20, 2022

roachtest: tpcc/multiregion/survive=region/chaos=true failed #85711

roachtest: tpcc/multiregion/survive=region/chaos=true failed #85711

Comments

cockroach-teamcity commented Aug 7, 2022 • edited by exalate-issue-sync bot Loading

cockroach-teamcity commented Aug 25, 2022

cockroach-teamcity commented Aug 28, 2022

nvanbenschoten commented Aug 29, 2022

cockroach-teamcity commented Sep 3, 2022

irfansharif commented Sep 7, 2022

irfansharif commented Sep 7, 2022 • edited Loading

irfansharif commented Sep 7, 2022 • edited Loading

irfansharif commented Sep 7, 2022

irfansharif commented Sep 7, 2022 • edited Loading

irfansharif commented Sep 8, 2022 • edited Loading

irfansharif commented Sep 8, 2022

nvanbenschoten commented Sep 8, 2022

yuzefovich commented Sep 8, 2022

cockroach-teamcity commented Sep 14, 2022

yuzefovich commented Sep 17, 2022

irfansharif commented Sep 17, 2022

irfansharif commented Sep 19, 2022 • edited Loading

irfansharif commented Sep 19, 2022

nvanbenschoten commented Sep 19, 2022 • edited Loading

irfansharif commented Sep 19, 2022

irfansharif commented Sep 20, 2022

yuzefovich commented Sep 20, 2022

cockroach-teamcity commented Aug 7, 2022 •

edited by exalate-issue-sync bot

Loading

irfansharif commented Sep 7, 2022 •

edited

Loading

irfansharif commented Sep 7, 2022 •

edited

Loading

irfansharif commented Sep 7, 2022 •

edited

Loading

irfansharif commented Sep 8, 2022 •

edited

Loading

irfansharif commented Sep 19, 2022 •

edited

Loading

nvanbenschoten commented Sep 19, 2022 •

edited

Loading