-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpch_concurrency failed #79870
Comments
Seems like #79716 didn't fix everything, taking a look, but I don't think it's a release blocker since we reverted the original change on all branches. |
Unfortunately, so far I have been unable to reproduce this, neither on a roachprod cluster nor locally. For now I'll make a slight change to the test to print out the DistSQL diagrams so that at least we now the distribution of the relevant ranges which should make reproducing this easier later. |
79935: streamingccl: fix nil linter check error and stream_ingestion_test error r=gh-casper a=gh-casper Previously, the stream ingestion test has no error since no duplicate event is generated. This PR fixes this and nil linter check error at the same time. Release note: None Closes: #79926 Related to #79929 which disables random_stream_client nil linter check. Will re-enable it after it's merged. 79980: roachtests: print DistSQL diagram in tpch_concurrency r=yuzefovich a=yuzefovich This will help with reproducing failures. Informs: #79870. Release note: None 80007: kv/kvprober: use single batch in write probe r=nvanbenschoten a=nvanbenschoten This commit adjusts `proberOpsImpl.Write` to issue its `Put`, `Del`, and `EndTxn` in a single `Batch`. This has a few positive effects: 1. it avoids writing intents 2. it requires a single pass through Raft 3. it avoids a pipeline stall after the Put goes through async consensus 4. it avoids async cleanup work after the txn commits It's not strictly necessary that we Put before we Del the key, because a Del is blind and leaves a tombstone even when the key is not live, but this is not guaranteed by the API, so we avoid depending on a subtlety of the implementation. 80116: ui: update statement key on ui r=maryliag a=maryliag Previously, we were displaying statement separated by the aggreation timestamp. Now that we don't show the aggregation timestamp column, we want to group all statements with same fingerprint id into a single row, no longer using the aggregation as part of a key. Aligning with the results we show for Statement Details page. Release note (ui change): Statements are not longer separated by aggregation interval on Statement Page. Now all statements with same fingerprint show as a single row. Co-authored-by: Casper <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]>
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
79292: randgen: add support for PARTITION BY to table generation r=rharding6373 a=rharding6373 This testing-only change adds support for `PARTITION BY LIST` on the prefix of a secondary index for randomly generated tables used for testing. We apply partitions to about ~10% of secondary indexes when there is more than one column in the index prefix, and add up to 10 partitions using random expressions. Fixes: #73920 Release note: None 80172: rowexec: use opaque keys to identify looked-up rows in lookup joins r=yuzefovich a=yuzefovich **sql: add span IDs to fetchers** This commit implements the idea of Jordan's 78223b9 to add opaque keys to fetchers. This allows the fetchers' users to identify the input `roachpb.Span` that a particular kv was fetched for. I decided to call these keys as "span IDs" since it seems less ambiguous in general, and especially so in the streamer world where too many "keys" are flying around already. Some additional improvements: - extract the return arguments of a function call into a helper struct - remove an allocation for `GetResponse`s in the non-streamer code path. Release note: None **rowexec: use opaque keys to identify looked-up rows in lookup joins** This commit refactors the span generators of the lookup joins to use the span IDs ("opaque keys") for matching between the looked-up rows and the input rows that correspond to them. Previously, we were maintaining a map from the span key (bytes value) and had to construct a partial key for each looked up row to get the matching input row indices. This commit makes it so that we now assign each span to lookup a unique span ID that is later used as an index into a two-dimensional slice of ints to get the matching indices. We still keep a map from the span key for the de-duplication purposes. This change is beneficial in several ways: - removes the need to construct the partial key for each looked up row - replaces a map lookup based on the partial key into a slice access - removes the need for the binary search among lookup spans in range-based lookup joins. This results in noticeable performance improvements in [microbenchmarks](https://gist.github.com/yuzefovich/3e5316e0a5ff32fd1b591b0583d78ced). Here is the difference on all TPCH queries (average over 20 runs of `tpchvec/perf` roachtest, negative number indicates reduction in latency): ``` Q2: 0.20s 0.20s 0.49% Q3: 1.96s 1.66s -14.86% Q4: 1.67s 1.62s -3.14% Q5: 2.82s 2.55s -9.55% Q7: 6.48s 6.93s 6.82% Q8: 1.08s 0.99s -8.34% Q9: 7.08s 6.36s -10.13% Q10: 2.16s 1.94s -9.99% Q11: 0.55s 0.51s -6.81% Q12: 8.78s 8.59s -2.13% Q16: 0.84s 0.82s -2.68% Q17: 0.49s 0.46s -6.80% Q20: 14.34s 13.43s -6.36% Q21: 9.95s 9.28s -6.76% Q22: 0.58s 0.54s -6.93% ``` (Queries that don't use lookup joins were ignored.) However, the main motivation behind this change was unblocking (or at least making it easier) to support lookup joins by the streamer. There, we want the streamer to operate with these "opaque keys" as well as take over the span de-duplication logic. Release note: None 80332: security: move logic in `security/password.go` to `security/password` r=ZhouXing19 a=ZhouXing19 This commit is to host the password logic in a separate package, `security/password/`. Release note: None 80416: evalctx: remove kv.DB from EvalContext r=RichardJCai a=RichardJCai Release note: None 80505: ccl/sqlproxyccl: update load balancing algorithm to use leastconns r=JeffSwenson a=jaylim-crl This commit replaces the existing load balancing algorithm within sqlproxy (i.e. weighted CPU load) with a leastconns approach as discussed. The idea behind this replacement is that CPU metrics are often stale, and we should be able to reasonably make routing decisions based on local data. Release note: None 80562: roachtest: fix a typo in a cluster setting name in tpch_concurrency r=yuzefovich a=yuzefovich Addresses: #79870 (comment). Release note: None 80591: rangefeed,storage: use time-bound iterator for with-diff catchup scan r=stevendanna,miretskiy a=sumeerbhola The MVCCIncrementalIterator already has support for selectively seeing versions beyond the time bounds, via the NextIgnoringTime method. This method is sufficient for with-diff catchup scans -- the changes to MVCCIncrementalIterator here are limited to comment improvements. To utilize NextIgnoringTime in the CatchUpIterator we need to distinguish between the reasons that a catchup scan wants to iterate to an older version of a key. These are now handled by different methods via the simpleCatchupIter interface. There is a simpleCatchupIterAdapter to provide a trivial implementation when the underlying iterator is a SimpleMVCCIterator. There is more testing of MVCCIncrementalIterator's NextIgnoringTime, and there is now a TestCatchupScan that tests with and without diff. BenchmarkCatchUpScan now also benchmarks with-diff. Results for linear-keys, where time-bound iteration shows an improvement: ``` BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=true/perc=0.00-16 3 397238811 ns/op 253609173 B/op 3006262 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=true/perc=50.00-16 6 192314659 ns/op 126847284 B/op 1503167 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=true/perc=75.00-16 12 93243340 ns/op 63433929 B/op 751606 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=true/perc=95.00-16 66 17636059 ns/op 12691501 B/op 150356 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=true/perc=99.00-16 296 3566355 ns/op 2560979 B/op 30126 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=false/perc=0.00-16 2 587807376 ns/op 253689672 B/op 3006770 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=false/perc=50.00-16 4 286110763 ns/op 126893106 B/op 1503413 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=false/perc=75.00-16 8 140012384 ns/op 63458643 B/op 751740 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=false/perc=95.00-16 43 26210520 ns/op 12695466 B/op 150380 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=true/withDiff=false/perc=99.00-16 200 5381206 ns/op 2561444 B/op 30133 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=true/perc=0.00-16 3 379387200 ns/op 253577896 B/op 3006228 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=true/perc=50.00-16 4 294533788 ns/op 130799366 B/op 1503677 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=true/perc=75.00-16 4 254943713 ns/op 69418826 B/op 752487 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=true/perc=95.00-16 5 220269848 ns/op 20291392 B/op 151411 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=true/perc=99.00-16 5 211813333 ns/op 10473760 B/op 31231 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=false/perc=0.00-16 3 379575618 ns/op 253591349 B/op 3006275 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=false/perc=50.00-16 4 282490494 ns/op 126852270 B/op 1503388 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=false/perc=75.00-16 5 241938049 ns/op 63463176 B/op 751985 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=false/perc=95.00-16 5 210848534 ns/op 12741640 B/op 150944 allocs/op BenchmarkCatchUpScan/linear-keys/useTBI=false/withDiff=false/perc=99.00-16 6 196175587 ns/op 2602912 B/op 30652 allocs/op ``` Fixes #78974 Release note: None 80650: testutils/buildutil: add disallowed_prefixes to disallowed_imports_test r=ajwerner a=ajwerner This is useful if you want to make sure that some part of the tree doesn't depend (transitively) on any other part of the tree. Release note: None 80656: bazel: in `bazel-generate.sh`, prune obviously irrelevant files r=irfansharif a=rickystewart Here we adapt a trick from the `Makefile`: we can use the `-prune` argument to `find` to filter out obviously irrelevant files from `node_modules` when traversing `pkg`. On my machine this saves about 3 seconds on no-op `dev generate bazel` invocations. Release note: None Co-authored-by: rharding6373 <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Jane Xing <[email protected]> Co-authored-by: richardjcai <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
roachtest.tpch_concurrency failed with artifacts on master @ a2e1910f51593bd2ef72e1d7c615e08f95791186:
Same failure on other branches
|
roachtest.tpch_concurrency failed with artifacts on master @ c9e0194b19a03d55c6be92572aad3fbafc256334:
Same failure on other branches
|
roachtest.tpch_concurrency failed with artifacts on master @ 7a9eb906ce86e2f75db637e29d46cd6604fca7b4:
Same failure on other branches
|
Hm, the timeout seems to be just Q15 taking really long time. Q15 creates and drops a view, so it could be schema-related. Here is a condensed timeline:
Note that a "crash" means that at least one node in the cluster crashed (probably due to an OOM) and a "success" means that all queries ran and no nodes crashed. The cluster is restarted for each concurrency run. |
This comment was marked as duplicate.
This comment was marked as duplicate.
80967: sql: disable streamer for lookup joins for now r=yuzefovich a=yuzefovich There is some fallout in the roachtests from using the streamer for the lookup join without ordering, so let's disable it for now while we're figuring things out. Informs: #80823. Informs: #79870 (comment). Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>
This comment was marked as duplicate.
This comment was marked as duplicate.
roachtest.tpch_concurrency failed with artifacts on master @ 7dd1c51f6b5802e32bafd82e46747f349836592f:
Same failure on other branches
|
roachtest.tpch_concurrency failed with artifacts on master @ d9fe5f67c75b1b59fc297bf4509a139c640b835b:
Same failure on other branches
|
roachtest.tpch_concurrency failed with artifacts on master @ 1e2cc61b58dc14386bb68dca59814874648931c2:
Same failure on other branches
|
roachtest.tpch_concurrency failed with artifacts on master @ e6815947a050e32f21e983aa30dc74ab2a247af3:
Same failure on other branches
|
roachtest.tpch_concurrency failed with artifacts on master @ e6815947a050e32f21e983aa30dc74ab2a247af3:
Same failure on other branches
|
roachtest.tpch_concurrency failed with artifacts on master @ 662cc5c3070e6d64d155a9cc9f33253ee5d99ee9:
Same failure on other branches
|
roachtest.tpch_concurrency failed with artifacts on master @ 1cea73c8a18623949b81705eb5f75179e6cd8d86:
Same failure on other branches
|
I think I have explanation for these timeouts. They are due to automatic retries that are encountered when executing Q15. That query consists of three statements: creating a view, a SELECT stmt, dropping that view. On master we have |
The TCP-H spec seems to indicate that the views should be assigned a |
Indeed, nice find Andrew, thanks! I'll implement that. |
roachtest.tpch_concurrency failed with artifacts on master @ 9b80166bc5f3d3d1212c44a9e1ac219d458af9e0:
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-15885
The text was updated successfully, but these errors were encountered: