roachperf: regression around 2022-09-09 [Go 1.19] #88038

erikgrinaker · 2022-09-16T10:44:34Z

Most pronounced on kv95/enc=false/nodes=3/cpu=96, but also shows up on many other benchmarks.

Jira issue: CRDB-19655

The text was updated successfully, but these errors were encountered:

blathers-crl · 2022-09-16T10:44:37Z

cc @cockroachdb/test-eng

erikgrinaker · 2022-09-16T12:09:05Z

Bisected this down to the Go 1.19 upgrade in #86049. Raw data:

$ roachprod create -n 4 --gce-machine-type n2-standard-32 --gce-local-ssd-count 4 --gce-zones europe-west1-b grinaker-bench
$ ./workload run kv --init --splits 1000 --concurrency 192 --read-percent 95 --ramp 120s --duration 180s

 d68046c799 2022-09-07: Merge #87448 [craig[bot]] = 118756.4
 7c01c817e5 2022-09-09: Merge #86451 [craig[bot]] = 113676.4
 1a637b0d7b 2022-09-08: Merge #87531 [craig[bot]] = 113873.7
 c83a2e8972 2022-09-07: server: fix error format message [Yahor Yuzefovich] = 108224.2 (bad run)
 76a5f6383f 2022-09-07: Merge #87174 #87318 #87403 #87443 #87450 #87461 #87493 #87504 [craig[bot]] = 112652.3
 2675c7c048 2022-08-12: *: upgrade to go 1.19.1 [Ricky Stewart] = 112210.3
 1ac25dcebf 2022-09-07: Merge #87497 [craig[bot]] = 118695.0

blathers-crl · 2022-09-16T12:13:29Z

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

blathers-crl · 2022-09-19T08:14:34Z

cc @cockroachdb/replication

erikgrinaker · 2022-09-19T08:15:08Z

Reassigning to repl until we find an owner to investigate the Go 1.19 regression.

ajwerner · 2022-09-19T16:30:29Z

It'd be good to get CPU profiles before and after the upgrade for some workload to compare.

erikgrinaker · 2022-09-19T18:56:34Z

Grabbed a couple of profiles at the same point of a kv95 workload, nothing stands out immediately.

The profile diff shows some changes in syscalls, stack handling, and scheduling:

$ go tool pprof -diff_base cpu-go1.18.pprof.gz cpu-go1.19.pprof.gz 
Duration: 10.23s, Total samples = 123.19s (1204.51%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for -0.64s, 0.52% of 123.19s total
Dropped 1433 nodes (cum <= 0.62s)
Showing top 10 nodes out of 2221
      flat  flat%   sum%        cum   cum%
    -7.18s  5.83%  5.83%     -0.79s  0.64%  syscall.Syscall
     6.51s  5.28%  0.54%      6.51s  5.28%  runtime/internal/syscall.Syscall6
     0.63s  0.51% 0.032%      0.62s   0.5%  runtime.adjustpointers
    -0.52s  0.42%  0.45%     -0.51s  0.41%  runtime.heapBitsSetType
    -0.50s  0.41%  0.86%     -0.67s  0.54%  github.com/cockroachdb/pebble/internal/arenaskl.(*Skiplist).findSpliceForLevel
     0.43s  0.35%  0.51%      2.09s  1.70%  runtime.adjustframe
     0.40s  0.32%  0.19%      0.74s   0.6%  runtime.getStackMap
    -0.39s  0.32%   0.5%     -0.39s  0.32%  runtime.duffcopy
    -0.38s  0.31%  0.81%     -0.38s  0.31%  sync.(*RWMutex).RLock
     0.36s  0.29%  0.52%      0.38s  0.31%  runtime.(*itabTableType).find

(pprof) top10 -cum
Showing nodes accounting for 6.73s, 5.46% of 123.19s total
Dropped 1433 nodes (cum <= 0.62s)
Showing top 10 nodes out of 2221
      flat  flat%   sum%        cum   cum%
     0.04s 0.032% 0.032%      6.55s  5.32%  syscall.RawSyscall6
     6.51s  5.28%  5.32%      6.51s  5.28%  runtime/internal/syscall.Syscall6
     0.21s  0.17%  5.49%      5.17s  4.20%  runtime.findRunnable
     0.01s 0.0081%  5.50%      4.90s  3.98%  github.com/cockroachdb/pebble.(*Iterator).constructPointIter.func1
     0.04s 0.032%  5.53%      4.72s  3.83%  github.com/cockroachdb/pebble.(*Iterator).Clone
     0.02s 0.016%  5.54%      4.70s  3.82%  github.com/cockroachdb/pebble.finishInitializingIter
    -0.05s 0.041%  5.50%      4.67s  3.79%  github.com/cockroachdb/pebble.(*Iterator).constructPointIter
         0     0%  5.50%      4.64s  3.77%  github.com/cockroachdb/cockroach/pkg/storage.newMVCCIterator
         0     0%  5.50%      4.61s  3.74%  github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingReader).NewMVCCIterator
    -0.05s 0.041%  5.46%      4.61s  3.74%  github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).initReuseOrCreate

erikgrinaker · 2022-09-19T19:07:14Z

For posterity, the Go 1.19 release notes say:

The runtime will now allocate initial goroutine stacks based on the historic average stack usage of goroutines. This avoids some of the early stack growth and copying needed in the average case in exchange for at most 2x wasted space on below-average goroutines.

We disabled this behavior with GODEBUG=adaptivestackstart=0, which had no effect on Go 1.19 performance.

ajwerner · 2022-09-19T20:53:15Z

Was this profile from GODEBUG=adaptivestackstart=0 or no?

erikgrinaker · 2022-09-19T20:54:41Z

No, vanilla run. GODEBUG=adaptivestackstart=0 was a separate experiment.

ajwerner · 2022-09-19T21:10:54Z

This does seem to have something to do with stack growth.

This is go1.19, and in go1.18:

ajwerner · 2022-09-19T21:12:15Z

What's interesting is that the grow codec is being used in both. I don't really understand. We should try doubling the growstack size and see what it does for us.

Now that we've adopted go 1.19, we notice that the performance is much worse (~8%) than we observed in go 1.18. Interestingly, we observe in profiles that we spend a lot more time increasing our stack size underneath request evaluation. This implied to me that some part of this is probably due to the runtime's new stack growth behavior. Perhaps what is going on is that the initial stacks are now smaller than they used to be so when we grow it, we grow it by less than we need to. I ran a benchmark that seems to indicate that this theory is true. I'd like to merge this to master and then backport it after we collect some more data. We never released this, so no note. Touches cockroachdb#88038 Release note: None

87739: kvcoord: (partially) de-flake tpcc/multiregion r=irfansharif a=irfansharif Touches #85711 fixing one of the failure modes. In #85101 we deleted code in the span refresher interceptor that terminated WriteTooOld flags. We did so assuming these flags were only set in 19.2 servers, but that's not the case -- TestWTOBitTerminatedOnErrorResponses demonstrates that it's possible for the server to return error responses with the bit set if a response is combined with an error from another request in the same batch request. Since we were no longer terminating the flag, it was possible to update the TxnCoordSender's embedded txn with this bit, an then use it when issuing subsequent batch requests -- something we were asserting against. Release note: None Release justification: Bug fix 88174: rowenc: fix needed column families computation for secondary indexes r=yuzefovich a=yuzefovich Previously, when determining the "minimal set of column families" required to retrieve all of the needed columns for the scan operation we could incorrectly not include the special zeroth family into the set. The KV for the zeroth column family is always present, so it might need to be fetched even when it's not explicitly needed when the "needed" column families are all nullable. Before this patch the code for determining whether all of the needed column families are nullable incorrectly assumed that all columns in a family are stored, but this is only true for the primary indexes - for the secondary indexes only those columns mentioned in `STORING` clause are actually stored (apart from the indexed and PK columns). As a result we could incorrectly not fetch a row if: - the unique secondary index is used - the needed column has a NULL value - all non-nullable columns from the same column family as the needed column are not stored in the index - other column families are not fetched. This is now fixed by considering only the set of stored columns. The bug seems relatively minor since it requires a multitude of conditions to be met, so I don't think it's a TA worthy. Fixes: #88110. Release note (bug fix): Previously, CockroachDB could incorrectly not fetch rows with NULL values when reading from the unique secondary index when multiple column families are defined for the table and the index doesn't store some of the NOT NULL columns. 88182: sql: fix relocate commands with NULLs r=yuzefovich a=yuzefovich Previously, we would crash when evaluating `EXPERIMENTAL_RELOCATE` commands when some of the values involved where NULL, and this is now fixed. There is no release note since the commands are "experimental" after all. Fixes: #87371. Release note: None 88187: util/growstack: increase stack growth for 1.19 r=nvanbenschoten a=ajwerner Now that we've adopted go 1.19, we notice that the performance is much worse (~8%) than we observed in go 1.18. Interestingly, we observe in profiles that we spend a lot more time increasing our stack size underneath request evaluation. This implied to me that some part of this is probably due to the runtime's new stack growth behavior. Perhaps what is going on is that the initial stacks are now smaller than they used to be so when we grow it, we grow it by less than we need to. I ran a benchmark that seems to indicate that this theory is true. I'd like to merge this to master and then backport it after we collect some more data. We never released this, so no note. Touches #88038 Release note: None 88195: colbuilder: don't use optimized IN operator for empty tuple r=yuzefovich a=yuzefovich This commit makes it so that we don't use the optimized IN operator for empty tuples since they handle NULLs incorrectly. This wasn't supposed to happen already due to 9b590d3 but there we only looked at the type and not at the actual datum. This is not a production bug since the optimizer normalizes such expressions away. Fixes: #88141. Release note: None Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

erikgrinaker · 2022-09-20T08:56:09Z

Nice catch @ajwerner. Ran with #88187 on top of 2675c7c and compared with its parent on kv95/enc=false/nodes=3/cpu=32, and this closes the gap. In fact, 1.19 was now 1% faster, but with n=1 that doesn't really say much. Thanks!

erikgrinaker · 2022-09-20T08:57:28Z

Resolved by #88187.

Now that we've adopted go 1.19, we notice that the performance is much worse (~8%) than we observed in go 1.18. Interestingly, we observe in profiles that we spend a lot more time increasing our stack size underneath request evaluation. This implied to me that some part of this is probably due to the runtime's new stack growth behavior. Perhaps what is going on is that the initial stacks are now smaller than they used to be so when we grow it, we grow it by less than we need to. I ran a benchmark that seems to indicate that this theory is true. I'd like to merge this to master and then backport it after we collect some more data. We never released this, so no note. Touches #88038 Release note: None

erikgrinaker added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-roachperf-investigation Issue opened as a result of roachperf triage T-testeng TestEng Team labels Sep 16, 2022

erikgrinaker changed the title ~~roachperf: regression around 2022-09-09~~ roachperf: regression around 2022-09-09 (Go 1.19) Sep 16, 2022

erikgrinaker changed the title ~~roachperf: regression around 2022-09-09 (Go 1.19)~~ roachperf: regression around 2022-09-09 [Go 1.19] Sep 16, 2022

erikgrinaker added the GA-blocker label Sep 16, 2022

erikgrinaker added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 T-kv-replication labels Sep 16, 2022

erikgrinaker removed the T-testeng TestEng Team label Sep 19, 2022

ajwerner mentioned this issue Sep 19, 2022

util/growstack: increase stack growth for 1.19 #88187

Merged

blathers-crl bot mentioned this issue Sep 20, 2022

release-22.2: util/growstack: increase stack growth for 1.19 #88206

Merged

erikgrinaker closed this as completed Sep 20, 2022

erikgrinaker mentioned this issue Sep 20, 2022

util/growstack: investigate Go 1.19 stack growth #88232

Open

exalate-issue-sync bot assigned erikgrinaker Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachperf: regression around 2022-09-09 [Go 1.19] #88038

roachperf: regression around 2022-09-09 [Go 1.19] #88038

erikgrinaker commented Sep 16, 2022 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Sep 16, 2022

erikgrinaker commented Sep 16, 2022

blathers-crl bot commented Sep 16, 2022

blathers-crl bot commented Sep 19, 2022

erikgrinaker commented Sep 19, 2022

ajwerner commented Sep 19, 2022

erikgrinaker commented Sep 19, 2022 •

edited

Loading

erikgrinaker commented Sep 19, 2022

ajwerner commented Sep 19, 2022

erikgrinaker commented Sep 19, 2022 •

edited

Loading

ajwerner commented Sep 19, 2022

ajwerner commented Sep 19, 2022

erikgrinaker commented Sep 20, 2022

erikgrinaker commented Sep 20, 2022

roachperf: regression around 2022-09-09 [Go 1.19] #88038

roachperf: regression around 2022-09-09 [Go 1.19] #88038

Comments

erikgrinaker commented Sep 16, 2022 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Sep 16, 2022

erikgrinaker commented Sep 16, 2022

blathers-crl bot commented Sep 16, 2022

blathers-crl bot commented Sep 19, 2022

erikgrinaker commented Sep 19, 2022

ajwerner commented Sep 19, 2022

erikgrinaker commented Sep 19, 2022 • edited Loading

erikgrinaker commented Sep 19, 2022

ajwerner commented Sep 19, 2022

erikgrinaker commented Sep 19, 2022 • edited Loading

ajwerner commented Sep 19, 2022

ajwerner commented Sep 19, 2022

erikgrinaker commented Sep 20, 2022

erikgrinaker commented Sep 20, 2022

erikgrinaker commented Sep 16, 2022 •

edited by cockroach-jira-scripts

Loading

erikgrinaker commented Sep 19, 2022 •

edited

Loading

erikgrinaker commented Sep 19, 2022 •

edited

Loading