Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachperf: 22.1 regression around 2022-07-08 #88122

Closed
erikgrinaker opened this issue Sep 19, 2022 · 8 comments
Closed

roachperf: 22.1 regression around 2022-07-08 #88122

erikgrinaker opened this issue Sep 19, 2022 · 8 comments
Labels
A-roachperf-investigation Issue opened as a result of roachperf triage C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 19, 2022

The release-22.1 branch saw a regression around July 8th, e.g. on kv95/enc=false/nodes=3/cpu=32:

Screenshot 2022-09-19 at 10 27 28

This is likely manifest on master as well.

Jira issue: CRDB-19698

@erikgrinaker erikgrinaker added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-roachperf-investigation Issue opened as a result of roachperf triage T-testeng TestEng Team labels Sep 19, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 19, 2022

cc @cockroachdb/test-eng

@erikgrinaker erikgrinaker changed the title roachperf: 22.1 regression around July 8th roachperf: 22.1 regression around 2022-07-08 Sep 19, 2022
@srosenberg
Copy link
Member

@erikgrinaker We did get an alert on Jul 6 [1]. The error bars on this benchmark seem fairly high (+/- 10k). Some recent runs on 22.2 do approach 120k, but the average is smaller albeit within the error bars. There does appear to be a minor regression, or it could just be noise. Considering the number of changes since July, is it still relevant and worthwhile?

[1] https://groups.google.com/a/cockroachlabs.com/g/test-eng/c/w6p0RgBQEt0/m/sE_TaXJGBAAJ

@erikgrinaker
Copy link
Contributor Author

Just by eyeballing the graph, we were at about 122k in June-July, and are now at about 118k. That's a 3% drop in end-to-end performance, which is significant. It would be interesting to find out what caused it and if we can easily claw it back.

@smg260
Copy link
Contributor

smg260 commented Dec 23, 2022

I ran a bisection with the following output (ignore the timestamps). The numbers represent the average throughput of 4 concurrent runs

2022-12-23T15:54:24Z   Bisecting regression in [kv95/enc=false/nodes=3/cpu=32] using commit range [80ad4861d5 (known good),e3d515357b (known bad)]
2022-12-23T15:54:27Z   Initial thresholds [good >= 128883, bad <= 124091]
2022-12-23T15:54:32Z   [9724e626f0] Average ops/s: [128847]. User marked as good. Threshold updated.
2022-12-23T15:54:33Z   [9609f46434] Average ops/s: [123678]. Auto marked as bad.
2022-12-23T15:55:57Z   [0f5936b08b] Average ops/s: [125471]. User skipped.
2022-12-23T15:55:58Z   [4a03024503] Average ops/s: [129208]. Auto marked as good.
2022-12-23T15:56:00Z   [cef41887e2] Average ops/s: [120661]. Auto marked as bad.

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
0f5936b08b1ef24abaf35f1967d020f66850b250
cef41887e24cd1cbf0ad925883372fee7f4b8569

Looking at the above 2 suspect neighbouring commits,

  • 0f5936b08b 125471 ops/s
  • cef41887e2, 120661 op/s

cef41887e2 is a possible culprit

commit cef41887e24cd1cbf0ad925883372fee7f4b8569
Merge: 0f5936b08b 66a81c6b4a
Author: Xin Hao Zhang <[email protected]>
Date:   Thu Jul 7 22:43:32 2022 -0400

    Merge pull request #84020 from xinhaoz/backport22.1-83616

commit 66a81c6b4a77f07ebe1a06a4f68e8fe8473fa4fe
Author: Xin Hao Zhang <[email protected]>
Date:   Thu Jun 23 15:34:53 2022 -0400

    sql, sqlstats: create temporary stats container for all txns
    
    Fixes: #81470
...

@erikgrinaker
Copy link
Contributor Author

Thanks for the bisect!

@cockroachdb/sql-observability @cockroachdb/sql-sessions This is pretty old now, but looks like we may have left about 5% end-to-end performance on the table here -- might be easy to get that back?

@xinhaoz
Copy link
Member

xinhaoz commented Jan 3, 2023

Thank you for finding this! The regression might be due to us now allocating a temporary stats container for every txn, when before we were only doing so for explicit txns. We can likely just reuse the txn stats container, or introduce a pool for all containers being created. Do you think that could be the culprit? I can file an issue for sql-obs.

@erikgrinaker
Copy link
Contributor Author

Yep, sounds plausible, appreciate you looking into this!

@srosenberg
Copy link
Member

Root cause appears to be sqlstats, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-roachperf-investigation Issue opened as a result of roachperf triage C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

4 participants