Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpch_concurrency failed #79870

Closed
cockroach-teamcity opened this issue Apr 12, 2022 · 20 comments · Fixed by #82330
Closed

roachtest: tpch_concurrency failed #79870

cockroach-teamcity opened this issue Apr 12, 2022 · 20 comments · Fixed by #82330
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 12, 2022

roachtest.tpch_concurrency failed with artifacts on master @ 9b80166bc5f3d3d1212c44a9e1ac219d458af9e0:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpch_concurrency/run_1
	test_runner.go:1006,test_runner.go:905: test timed out (18h0m0s)
Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-15885

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 12, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Apr 12, 2022
@yuzefovich
Copy link
Member

Seems like #79716 didn't fix everything, taking a look, but I don't think it's a release blocker since we reverted the original change on all branches.

@yuzefovich yuzefovich self-assigned this Apr 13, 2022
@yuzefovich yuzefovich removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 13, 2022
@yuzefovich
Copy link
Member

Unfortunately, so far I have been unable to reproduce this, neither on a roachprod cluster nor locally. For now I'll make a slight change to the test to print out the DistSQL diagrams so that at least we now the distribution of the relevant ranges which should make reproducing this easier later.

craig bot pushed a commit that referenced this issue Apr 18, 2022
79935: streamingccl: fix nil linter check error and stream_ingestion_test error r=gh-casper a=gh-casper

Previously, the stream ingestion test has no error since no
duplicate event is generated. This PR fixes this and nil linter
check error at the same time.

Release note: None
Closes: #79926

Related to #79929 which 
disables random_stream_client nil linter check. Will re-enable it after it's 
merged.

79980: roachtests: print DistSQL diagram in tpch_concurrency r=yuzefovich a=yuzefovich

This will help with reproducing failures.

Informs: #79870.

Release note: None

80007: kv/kvprober: use single batch in write probe r=nvanbenschoten a=nvanbenschoten

This commit adjusts `proberOpsImpl.Write` to issue its `Put`, `Del`, and
`EndTxn` in a single `Batch`. This has a few positive effects:
1. it avoids writing intents
2. it requires a single pass through Raft
3. it avoids a pipeline stall after the Put goes through async consensus
4. it avoids async cleanup work after the txn commits

It's not strictly necessary that we Put before we Del the key, because a
Del is blind and leaves a tombstone even when the key is not live, but
this is not guaranteed by the API, so we avoid depending on a subtlety
of the implementation.

80116: ui: update statement key on ui r=maryliag a=maryliag

Previously, we were displaying statement separated
by the aggreation timestamp. Now that we don't show the
aggregation timestamp column, we want to group all
statements with same fingerprint id into a single row,
no longer using the aggregation as part of a key.
Aligning with the results we show for Statement Details
page.

Release note (ui change): Statements are not longer separated
by aggregation interval on Statement Page. Now all statements
with same fingerprint show as a single row.

Co-authored-by: Casper <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Marylia Gutierrez <[email protected]>
@cockroach-teamcity

This comment was marked as resolved.

@cockroach-teamcity

This comment was marked as resolved.

craig bot pushed a commit that referenced this issue Apr 27, 2022
79292: randgen: add support for PARTITION BY to table generation r=rharding6373 a=rharding6373

This testing-only change adds support for `PARTITION BY LIST` on the
prefix of a secondary index for randomly generated tables used for
testing. We apply partitions to about ~10% of secondary indexes when
there is more than one column in the index prefix, and add up to 10
partitions using random expressions.

Fixes: #73920

Release note: None

80172: rowexec: use opaque keys to identify looked-up rows in lookup joins r=yuzefovich a=yuzefovich

**sql: add span IDs to fetchers**

This commit implements the idea of Jordan's
78223b9 to add opaque keys to fetchers.
This allows the fetchers' users to identify the input `roachpb.Span`
that a particular kv was fetched for. I decided to call these keys as
"span IDs" since it seems less ambiguous in general, and especially so
in the streamer world where too many "keys" are flying around already.

Some additional improvements:
- extract the return arguments of a function call into a helper struct
- remove an allocation for `GetResponse`s in the non-streamer code path.

Release note: None

**rowexec: use opaque keys to identify looked-up rows in lookup joins**

This commit refactors the span generators of the lookup joins to use the
span IDs ("opaque keys") for matching between the looked-up rows and the
input rows that correspond to them. Previously, we were maintaining
a map from the span key (bytes value) and had to construct a partial key
for each looked up row to get the matching input row indices. This
commit makes it so that we now assign each span to lookup a unique span
ID that is later used as an index into a two-dimensional slice of ints
to get the matching indices. We still keep a map from the span key for
the de-duplication purposes.

This change is beneficial in several ways:
- removes the need to construct the partial key for each looked up row
- replaces a map lookup based on the partial key into a slice access
- removes the need for the binary search among lookup spans in
range-based lookup joins.

This results in noticeable performance improvements in
[microbenchmarks](https://gist.github.com/yuzefovich/3e5316e0a5ff32fd1b591b0583d78ced).

Here is the difference on all TPCH queries (average over 20 runs of
`tpchvec/perf` roachtest, negative number indicates reduction in
latency):
```
Q2:   0.20s  0.20s  0.49%
Q3:   1.96s  1.66s -14.86%
Q4:   1.67s  1.62s -3.14%
Q5:   2.82s  2.55s -9.55%
Q7:   6.48s  6.93s  6.82%
Q8:   1.08s  0.99s -8.34%
Q9:   7.08s  6.36s -10.13%
Q10:  2.16s  1.94s -9.99%
Q11:  0.55s  0.51s -6.81%
Q12:  8.78s  8.59s -2.13%
Q16:  0.84s  0.82s -2.68%
Q17:  0.49s  0.46s -6.80%
Q20: 14.34s 13.43s -6.36%
Q21:  9.95s  9.28s -6.76%
Q22:  0.58s  0.54s -6.93%
```
(Queries that don't use lookup joins were ignored.)

However, the main motivation behind this change was unblocking (or at
least making it easier) to support lookup joins by the streamer. There,
we want the streamer to operate with these "opaque keys" as well as take
over the span de-duplication logic.

Release note: None

80332: security: move logic in `security/password.go` to `security/password` r=ZhouXing19 a=ZhouXing19

This commit is to host the password logic in a separate package,
`security/password/`.

Release note: None

80416: evalctx: remove kv.DB from EvalContext r=RichardJCai a=RichardJCai

Release note: None

80505: ccl/sqlproxyccl: update load balancing algorithm to use leastconns r=JeffSwenson a=jaylim-crl

This commit replaces the existing load balancing algorithm within sqlproxy
(i.e. weighted CPU load) with a leastconns approach as discussed. The idea
behind this replacement is that CPU metrics are often stale, and we should be
able to reasonably make routing decisions based on local data.

Release note: None

80562: roachtest: fix a typo in a cluster setting name in tpch_concurrency r=yuzefovich a=yuzefovich

Addresses: #79870 (comment).

Release note: None

80591: rangefeed,storage: use time-bound iterator for with-diff catchup scan r=stevendanna,miretskiy a=sumeerbhola

The MVCCIncrementalIterator already has support for selectively
seeing versions beyond the time bounds, via the NextIgnoringTime
method. This method is sufficient for with-diff catchup scans -- the
changes to MVCCIncrementalIterator here are limited to comment
improvements.

To utilize NextIgnoringTime in the CatchUpIterator we need to
distinguish between the reasons that a catchup scan wants to
iterate to an older version of a key. These are now handled by
different methods via the simpleCatchupIter interface. There
is a simpleCatchupIterAdapter to provide a trivial implementation
when the underlying iterator is a SimpleMVCCIterator.

There is more testing of MVCCIncrementalIterator's NextIgnoringTime,
and there is now a TestCatchupScan that tests with and without diff.
BenchmarkCatchUpScan now also benchmarks with-diff. Results for
linear-keys, where time-bound iteration shows an improvement:

```
BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=true/perc=0.00-16         	       3	 397238811 ns/op	253609173 B/op	 3006262 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=true/perc=50.00-16        	       6	 192314659 ns/op	126847284 B/op	 1503167 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=true/perc=75.00-16        	      12	  93243340 ns/op	63433929 B/op	  751606 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=true/perc=95.00-16        	      66	  17636059 ns/op	12691501 B/op	  150356 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=true/perc=99.00-16        	     296	   3566355 ns/op	 2560979 B/op	   30126 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=false/perc=0.00-16        	       2	 587807376 ns/op	253689672 B/op	 3006770 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=false/perc=50.00-16       	       4	 286110763 ns/op	126893106 B/op	 1503413 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=false/perc=75.00-16       	       8	 140012384 ns/op	63458643 B/op	  751740 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=false/perc=95.00-16       	      43	  26210520 ns/op	12695466 B/op	  150380 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=false/perc=99.00-16       	     200	   5381206 ns/op	 2561444 B/op	   30133 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=true/perc=0.00-16        	       3	 379387200 ns/op	253577896 B/op	 3006228 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=true/perc=50.00-16       	       4	 294533788 ns/op	130799366 B/op	 1503677 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=true/perc=75.00-16       	       4	 254943713 ns/op	69418826 B/op	  752487 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=true/perc=95.00-16       	       5	 220269848 ns/op	20291392 B/op	  151411 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=true/perc=99.00-16       	       5	 211813333 ns/op	10473760 B/op	   31231 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=false/perc=0.00-16       	       3	 379575618 ns/op	253591349 B/op	 3006275 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=false/perc=50.00-16      	       4	 282490494 ns/op	126852270 B/op	 1503388 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=false/perc=75.00-16      	       5	 241938049 ns/op	63463176 B/op	  751985 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=false/perc=95.00-16      	       5	 210848534 ns/op	12741640 B/op	  150944 allocs/op
BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=false/perc=99.00-16      	       6	 196175587 ns/op	 2602912 B/op	   30652 allocs/op
```

Fixes #78974

Release note: None

80650: testutils/buildutil: add disallowed_prefixes to disallowed_imports_test r=ajwerner a=ajwerner

This is useful if you want to make sure that some part of the tree doesn't
depend (transitively) on any other part of the tree.

Release note: None

80656: bazel: in `bazel-generate.sh`, prune obviously irrelevant files r=irfansharif a=rickystewart

Here we adapt a trick from the `Makefile`: we can use the `-prune`
argument to `find` to filter out obviously irrelevant files from
`node_modules` when traversing `pkg`. On my machine this saves about 3
seconds on no-op `dev generate bazel` invocations.

Release note: None

Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Jane Xing <[email protected]>
Co-authored-by: richardjcai <[email protected]>
Co-authored-by: Jay <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@cockroach-teamcity
Copy link
Member Author

roachtest.tpch_concurrency failed with artifacts on master @ a2e1910f51593bd2ef72e1d7c615e08f95791186:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpch_concurrency/run_1
	tpch_concurrency.go:114,monitor.go:105,errgroup.go:57: dial tcp 34.148.161.250:26257: connect: connection refused

	cluster.go:1855,tpch_concurrency.go:68,tpch_concurrency.go:79,tpch_concurrency.go:199,tpch_concurrency.go:224,test_runner.go:876: cluster.StartE: context canceled
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).StartE
		  | 	main/pkg/cmd/roachtest/cluster.go:1787
		  | main.(*clusterImpl).Start
		  | 	main/pkg/cmd/roachtest/cluster.go:1854
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCHConcurrency.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpch_concurrency.go:68
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCHConcurrency.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpch_concurrency.go:79
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCHConcurrency.func4
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpch_concurrency.go:199
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCHConcurrency.func5
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpch_concurrency.go:224
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:876
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) cluster.StartE
		Wraps: (3) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpch_concurrency failed with artifacts on master @ c9e0194b19a03d55c6be92572aad3fbafc256334:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpch_concurrency/run_1
	tpch_concurrency.go:114,monitor.go:105,errgroup.go:57: dial tcp 34.74.127.69:26257: connect: connection refused

	cluster.go:1855,tpch_concurrency.go:68,tpch_concurrency.go:79,tpch_concurrency.go:199,tpch_concurrency.go:224,test_runner.go:876: cluster.StartE: context canceled
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).StartE
		  | 	main/pkg/cmd/roachtest/cluster.go:1787
		  | main.(*clusterImpl).Start
		  | 	main/pkg/cmd/roachtest/cluster.go:1854
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCHConcurrency.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpch_concurrency.go:68
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCHConcurrency.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpch_concurrency.go:79
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCHConcurrency.func4
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpch_concurrency.go:199
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCHConcurrency.func5
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpch_concurrency.go:224
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:876
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) cluster.StartE
		Wraps: (3) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpch_concurrency failed with artifacts on master @ 7a9eb906ce86e2f75db637e29d46cd6604fca7b4:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpch_concurrency/run_1
	test_runner.go:1007,test_runner.go:906: test timed out (12h0m0s)
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@yuzefovich
Copy link
Member

Hm, the timeout seems to be just Q15 taking really long time. Q15 creates and drops a view, so it could be schema-related.

Here is a condensed timeline:

  • concurrency: 104, all Q15 took about 12 minutes, crash on Q18
  • concurrency: 76, all Q15 took about 11 minutes, crash on Q20
  • concurrency: 62, all Q15 took about 12 minutes, success
  • concurrency: 69, all Q15 took about 27 minutes, success
  • concurrency: 72, all Q15 took about 65 minutes, crash on Q20
  • concurrency: 70, all Q15 took about 76 minutes, success
  • concurrency: 71, the test timed out on Q15 after 17 minutes.

Note that a "crash" means that at least one node in the cluster crashed (probably due to an OOM) and a "success" means that all queries ran and no nodes crashed. The cluster is restarted for each concurrency run.

@cockroach-teamcity

This comment was marked as duplicate.

craig bot pushed a commit that referenced this issue May 4, 2022
80967: sql: disable streamer for lookup joins for now r=yuzefovich a=yuzefovich

There is some fallout in the roachtests from using the streamer for the
lookup join without ordering, so let's disable it for now while we're
figuring things out.

Informs: #80823.
Informs: #79870 (comment).

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
@cockroach-teamcity

This comment was marked as duplicate.

@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.tpch_concurrency failed with artifacts on master @ 7dd1c51f6b5802e32bafd82e46747f349836592f:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpch_concurrency/run_1
	tpch_concurrency.go:114,monitor.go:105,errgroup.go:57: dial tcp 35.231.212.140:26257: connect: connection refused

	cluster.go:1840,tpch_concurrency.go:68,tpch_concurrency.go:79,tpch_concurrency.go:199,tpch_concurrency.go:224,test_runner.go:884: cluster.StartE: context canceled
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).StartE
		  | 	main/pkg/cmd/roachtest/cluster.go:1772
		  | main.(*clusterImpl).Start
		  | 	main/pkg/cmd/roachtest/cluster.go:1839
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCHConcurrency.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpch_concurrency.go:68
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCHConcurrency.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpch_concurrency.go:79
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCHConcurrency.func4
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpch_concurrency.go:199
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCHConcurrency.func5
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpch_concurrency.go:224
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:884
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) cluster.StartE
		Wraps: (3) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpch_concurrency failed with artifacts on master @ d9fe5f67c75b1b59fc297bf4509a139c640b835b:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpch_concurrency/run_1
	test_runner.go:1015,test_runner.go:914: test timed out (12h0m0s)
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpch_concurrency failed with artifacts on master @ 1e2cc61b58dc14386bb68dca59814874648931c2:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpch_concurrency/run_1
	test_runner.go:1015,test_runner.go:914: test timed out (12h0m0s)
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpch_concurrency failed with artifacts on master @ e6815947a050e32f21e983aa30dc74ab2a247af3:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpch_concurrency/run_1
	test_runner.go:1015,test_runner.go:914: test timed out (12h0m0s)
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpch_concurrency failed with artifacts on master @ e6815947a050e32f21e983aa30dc74ab2a247af3:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpch_concurrency/run_1
	test_runner.go:1015,test_runner.go:914: test timed out (12h0m0s)
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpch_concurrency failed with artifacts on master @ 662cc5c3070e6d64d155a9cc9f33253ee5d99ee9:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpch_concurrency/run_1
	test_runner.go:1015,test_runner.go:914: test timed out (12h0m0s)
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpch_concurrency failed with artifacts on master @ 1cea73c8a18623949b81705eb5f75179e6cd8d86:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpch_concurrency/run_1
	test_runner.go:1015,test_runner.go:914: test timed out (12h0m0s)
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@yuzefovich
Copy link
Member

I think I have explanation for these timeouts. They are due to automatic retries that are encountered when executing Q15. That query consists of three statements: creating a view, a SELECT stmt, dropping that view. On master we have enable_implicit_transaction_for_batch_statements session variable set to true, so all three statements are executed as part of a single txn (on 22.1 branch the default is false, so we're not hitting the timeouts). All concurrent txns are contending with each other (on creation and dropping of the view), and usually one connection is making progress while all others "blocked". With the streamer enabled it seems that the performance of creating the view and the SELECT query is worse, so each connection takes longer to process the query, meaning that the "queue" of connections is stalled for longer periods of time. I think the fix is to disable the implicit txn for multi-statement queries since it's not really the goal of this test to exercise DDL with high concurrency.

@ajwerner
Copy link
Contributor

ajwerner commented Jun 2, 2022

The TCP-H spec seems to indicate that the views should be assigned a StreamID suffix so that different threads don't interfere. i don't think it'd be a huge refactor to give each worker its own copy of the queries with a StreamID templated in. That'd also fix it and allow us to get rid of the IF NOT EXISTS clauses we've added.

@yuzefovich
Copy link
Member

Indeed, nice find Andrew, thanks! I'll implement that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
4 participants