-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workload: jitter the teardown of connections to prevent thundering herd #98689
Conversation
8ddb59e
to
3b07d59
Compare
This includes one notable fix for `workload` (see #98689): - Fix MaxConnLifetimeJitter Release note: None
7f38a8a
to
1e2efd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @sravotto, and @srosenberg)
pkg/workload/pgx_helpers.go
line 191 at r3 (raw file):
// MultiConnPool. The max number of connections are warmed up if numConns is // set to 0. func (m *MultiConnPool) WarmupConns(ctx context.Context, numConns int) error {
I refactored connection pool warming out of NewMultiConnPool()
into WarmupConns()
so that a future engineer can potentially disable connection warmups when the --ramp
argument is used. Right now --ramp
'ed workloads will have all connections established, which is still useful, but at present connections don't ramp with the workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes seem sound to me. We should run some roachperfs before merging to the reduce the risk of a regression.
Reviewed 1 of 22 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @sean-, and @sravotto)
pkg/workload/pgx_helpers.go
line 120 at r4 (raw file):
logLevel = cfg.LogLevel } maxConnLifetime := 5 * time.Minute
Previously (in v4), this default was 1 hour. What's the rationale for reducing it down to 5 minutes?
pkg/workload/pgx_helpers.go
line 160 at r4 (raw file):
} // Disable the automatic prepared statement cache. We've seen a lot of
Actual examples would be preferred. Is prepared statement cache detrimental in a majority of the cases? Does the cache churn have a non-negligible performance penalty for workload throughput?
pkg/workload/pgx_helpers.go
line 164 at r4 (raw file):
connCfg := poolCfg.ConnConfig connCfg.DefaultQueryExecMode = pgx.QueryExecModeDescribeExec connCfg.StatementCacheCapacity = 0
These two cache capacity defaults corresponds to QueryExecModeCacheStatement
and QueryExecModeCacheDescribe
. Why are we overriding them when the default mode is QueryExecModeDescribeExec
?
pkg/workload/pgx_helpers.go
line 237 at r4 (raw file):
var numWarmupConns int if numConns <= 0 {
< 0
is handled above. In fact, it feels like a precondition.
pkg/workload/sql_runner.go
line 283 at r4 (raw file):
args = append(args, simpleQuery) copy(args[1:], args) args[0] = simpleQuery
This looks like a typo. The first element should be of type pgx.QueryExecMode
, not []pgx.QueryExecMode
.
pkg/workload/kv/kv.go
line 602 at r4 (raw file):
var tx pgx.Tx pl := o.mcp.Get() conn, err := pl.Acquire(ctx)
BeginTx
does invoke Acquire
, why the more verbose version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional testing would be welcome, but we don't see any major regressions. Thanks to @sravotto, the new binary is already being tested. Specifically, the following is a graph of the before and after, starting at ~8:25. To get this effect, both #98689 and #98701 are required. I expect we will see an improvement in tail latency with this new change, however, both because of the jitter, and because of a change in pgx 5.1.0 where pgx updated their internal version of Puddle to do a non-blocking, background acquisition of a resource:
https://github.com/jackc/pgx/blob/master/CHANGELOG.md#510-november-12-2022
https://github.com/jackc/puddle/blob/master/CHANGELOG.md#130-august-27-2022
jackc/pgx@c00fb5d
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @sravotto, and @srosenberg)
pkg/workload/pgx_helpers.go
line 120 at r4 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Previously (in v4), this default was 1 hour. What's the rationale for reducing it down to 5 minutes?
To allow for reasonable management of crdb clusters. Specifically, if a crdb decommission was kicked off, we don't want to wait for 1hr for connections to drain. Instead, we can kick off a benchmark and wait for only 5min to have the workload driver to recycle all of its connections, which significantly shortens the amount of time required for an operator to wait when performing any kind of maintenance. The cost of establishing a TLS connection isn't so great that it can't be safely amortized over a 5min period of time.
pkg/workload/pgx_helpers.go
line 160 at r4 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Actual examples would be preferred. Is prepared statement cache detrimental in a majority of the cases? Does the cache churn have a non-negligible performance penalty for workload throughput?
These comments were present before the refactoring required to update to pgx v5; therefore, I didn't revalidate them.
pkg/workload/pgx_helpers.go
line 164 at r4 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
These two cache capacity defaults corresponds to
QueryExecModeCacheStatement
andQueryExecModeCacheDescribe
. Why are we overriding them when the default mode isQueryExecModeDescribeExec
?
Ditto. I wondered the same thing myself but didn't want to open a new investigation into a problem I wasn't attempting to solve.
pkg/workload/pgx_helpers.go
line 237 at r4 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
< 0
is handled above. In fact, it feels like a precondition.
Updated this check to be ==
instead. The < 0
check short-circuits the warmup entirely, whereas == 0
auto-sizes the warmup to max conns. There is no change in behavior, but this is slightly more explicit (the previous < 0
short-circuit didn't exist until late in this update).
pkg/workload/sql_runner.go
line 283 at r4 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
This looks like a typo. The first element should be of type
pgx.QueryExecMode
, not[]pgx.QueryExecMode
.
This is the same copy()
logic that was there previously and just a slice trick to prepend a value:
cockroach/pkg/workload/sql_runner.go
Lines 280 to 282 in 85c6e38
args = append(args, pgx.QuerySimpleProtocol(true)) | |
copy(args[1:], args) | |
args[0] = pgx.QuerySimpleProtocol(true) |
pkg/workload/kv/kv.go
line 602 at r4 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
BeginTx
does invokeAcquire
, why the more verbose version?
TIL. I was doing a strict refactor of what was there. I've updated this to use BeginTx()
. Thank you!
This includes one notable fix for `workload` (see #98689): - Fix MaxConnLifetimeJitter Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 22 files at r1, 4 of 4 files at r8, 13 of 13 files at r11, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @sean-, @sravotto, and @srosenberg)
pkg/workload/pgx_helpers.go
line 273 at r11 (raw file):
} // AddPreparedStatement adds the given sql statement to the map of
The comment on this method is off.
pkg/workload/pgx_helpers.go
line 323 at r11 (raw file):
g.Go(func() error { if err := sem.Acquire(warmupCtx, 1); err != nil { warmupConns <- struct{}{}
With an error group, once any goroutine returns an error, the context is cancelled, so you should be able to select { <-warmupConns, <-warmupCtx }
down below instead of doing this dance with the warmup conn accounting and sem release on the error path.
pkg/workload/sql_runner.go
line 276 at r11 (raw file):
} // prependQuerySimpleProtocol inserts pgx.QuerySimpleProtocol(true) at the
What allows us to remove this for the QueryExecModeSimpleProtocol
case? Are we telling pgx to use QueryExecModeSimpleProtocol
somewhere else?
pkg/workload/indexes/indexes.go
line 162 at r11 (raw file):
} cfg := workload.MultiConnPoolCfg{ ConnHealthCheckPeriod: w.connFlags.ConnHealthCheckPeriod,
This looks tedious and error-prone. Should we pull out a constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @renatolabs, @sravotto, and @srosenberg)
pkg/workload/sql_runner.go
line 276 at r11 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What allows us to remove this for the
QueryExecModeSimpleProtocol
case? Are we telling pgx to useQueryExecModeSimpleProtocol
somewhere else?
Exec()
via pgxpool
calls Exec()
in pgconn
, which always uses the simple query protocol.
pkg/workload/indexes/indexes.go
line 162 at r11 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This looks tedious and error-prone. Should we pull out a constructor?
I was cargo-culting the existing style here, but agree this should be revisited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @renatolabs, @sravotto, and @srosenberg)
pkg/workload/indexes/indexes.go
line 162 at r11 (raw file):
Previously, sean- (Sean Chittenden) wrote…
I was cargo-culting the existing style here, but agree this should be revisited.
Revisited and created a helper function that creates a workload.MultiConnPoolCfg
from a workload.ConnFlags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change here looks good. Thanks for making it and iterating on this.
Reviewed 1 of 2 files at r13, 1 of 1 files at r14, 6 of 6 files at r15, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs, @sean-, @sravotto, and @srosenberg)
pkg/workload/histogram/histogram.go
line 76 at r15 (raw file):
func (w *NamedHistogram) Record(elapsed time.Duration) { w.prometheusHistogram.Observe(float64(elapsed.Nanoseconds()) / float64(time.Second)) w.mu.Lock()
Nice find. What brought you here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @bobvawter, @renatolabs, @sravotto, and @srosenberg)
pkg/workload/histogram/histogram.go
line 76 at r15 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Nice find. What brought you here?
I started getting desperate to stamp out a race condition and ran the race detector as a quick first pass, and stumbled across this (h/t to @bobvawter for the suggestion). The bug is when running workload
with <100% of the connections pre-warmed, one connection Acquire()
succeeds but returns an invalid connection, leading to the operator being required to run the test with --tolerate-errors
. This shouldn't be necessary but only happens at higher levels of concurrency and with low numbers of pre-warmed connections. The following is for posterity and not something I want to have this PR block on.
I haven't been able to hunt down the bug yet. I thought I resolved this, but by adding debug code, I masked the race condition. Infuriatingly, if I inject debug code in the error branch, the error never presents itself. (Mission accomplished? 😆). Specifically, if you warm up less than 100% of the connections (this is a new feature, so this isn't a regression), then the call to pgxpool.Acquire()
can (and frequently does while the workload is just starting), return a nil
conn
. The err
return value varies, and the most common error is read: connection reset by peer
. With low levels of concurrency, this doesn't happen, but when using a concurrency of 300 and only warming up one connection, it happens nearly every run.
What's wild is putting in q.Q()
(a debugging library I use) in the error path[2] seems to be all that's required to prevent the race from happening. Downgrading from q.Q()
to log.Warningf()
reintroduces the problem. I've been looking into pgxpool
's Acquire()
and Release()
code paths, as well as jackc/puddle
but haven't been able to figure out where the bug is. My current suspicion is a pgxpool.Conn
is being Release()
'ed back into the pool in an invalid state. I've added AfterRelease()
and BeforeAcquire()
handlers to try and catch things inside of pgxpool
, but struck out there.
I'm starting to suspect there's something subtle in the memory model that isn't being handled correctly in jackc/puddle
's Acquire()
or Release()
. I'm going to continue to pursue this in the background because my gaze is starting to move external to workload
.
[1] Sample workload
output w/ bug:
$ workload run kv --tolerate-errors --db movr '--concurrency=300' --min-conns 1 '--duration=3s' '--warmup-conns=1' 'postgresql://[email protected]:26257/movr?sslmode=disable&name=url1'
INFO: Build completed successfully, 32 total actions
I230329 15:37:52.285739 1 workload/cli/run.go:622 [T1] 1 random seed: -2816927724130576403
I230329 15:37:52.285813 1 workload/cli/run.go:429 [T1] 2 creating load generator...
I230329 15:37:52.292837 1 workload/cli/run.go:460 [T1] 3 creating load generator... done (took 7.022875ms)
E230329 15:37:52.321974 1 workload/cli/run.go:548 [T1] 4 read tcp 127.0.0.1:52708 -> 127.0.0.1:26257: read: connection reset by peer
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
1.0s 96 19013.8 19155.6 14.2 23.1 75.5 96.5 write
2.0s 96 19489.6 19322.9 16.3 26.2 37.7 52.4 write
3.0s 96 19349.8 19331.8 16.8 28.3 37.7 58.7 write
Highest sequence written: 58300. Can be passed as --write-seq=R58300 to the next run.
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
3.0s 96 58000 19330.0 15.4 15.7 26.2 39.8 96.5 write
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
3.0s 96 58000 19330.0 15.4 15.7 26.2 39.8 96.5
[2] Sample patch that masks the problem:
➜ git diff pkg/workload/sql_runner.go
diff --git a/pkg/workload/sql_runner.go b/pkg/workload/sql_runner.go
index 8f579a69258..6c39e657e76 100644
--- a/pkg/workload/sql_runner.go
+++ b/pkg/workload/sql_runner.go
@@ -14,6 +14,7 @@ import (
"context"
"fmt"
+ // "github.com/cockroachdb/cockroach/pkg/util/log"
+ "github.com/ryboe/q"
"github.com/cockroachdb/errors"
"github.com/jackc/pgx/v5"
"github.com/jackc/pgx/v5/pgconn"
@@ -110,8 +111,16 @@ type StmtHandle struct {
// See pgx.Conn.Exec.
func (h StmtHandle) Exec(ctx context.Context, args ...interface{}) (pgconn.CommandTag, error) {
h.check()
conn, err := h.s.sr.mcp.Get().Acquire(ctx)
if err != nil {
+ // log.Warningf(ctx, "race conditions gunna race: %v", h.s.sr.mcp)
+ q.Q("mcp", h.s.sr.mcp)
return pgconn.CommandTag{}, err
}
defer conn.Release()
bors try |
tryBuild failed: |
bors retry |
tryBuild failed: |
bdb8e3e
to
072c876
Compare
bors retry |
tryBuild succeeded: |
This change upgrades workload's use of pgx from v4 to v5 in order to allow jittering the teardown of connections. This change sets a max connection age of 5min and jitters the teardown by 30s. Upgrading to pgx v5 also adds non-blocking pgxpool connection acquisition. workload: add flags to manage the age and lifecycle of connection pool Add flags to all workload types to specify: * the max connection age: `--max-conn-lifetime duration` * the max connection age jitter: `--max-conn-lifetime-jitter duration` * the max connection idle time: `--max-conn-idle-time duration` * the connection health check interval: `--conn-healthcheck-period duration` * the min number of connections in the pool: `--min-conns int` workload: add support for remaining pgx query modes Add support for pgx.QueryExecModeCacheDescribe and pgx.QueryExecModeDescribeExec. Previously, only three of the five query modes were available. workload: fix race condition when recording histogram data Release note (cli change): workload jitters teardown of connections to prevent thundering herd impacting P99 latency results. Release note (cli change): workload utility now has flags to tune the connection pool used for testing. See `--conn-healthcheck-period`, `--min-conns`, and the `--max-conn-*` flags for details. Release note (cli change): workload now supports every [PostgreSQL query mode](https://github.com/jackc/pgx/blob/fa5fbed497bc75acee05c1667a8760ce0d634cba/conn.go#L167-L182) available via the underlying pgx driver.
072c876
to
98e324a
Compare
bors r+ |
This PR was included in a batch that successfully built, but then failed to merge into master (it was a non-fast-forward update). It will be automatically retried. |
Build succeeded: |
workload: jitter the teardown of connections to prevent thundering herd
This change upgrades workload's use of pgx from v4 to v5 in order to allow jittering the teardown of connections. This change sets a max connection age of 5min and jitters the teardown by 30s. Upgrading to pgx v5 also adds non-blocking pgxpool connection acquisition.
workload: add flags to manage the age and lifecycle of connection pool
Add flags to all workload types to specify:
--max-conn-lifetime duration
--max-conn-lifetime-jitter duration
--max-conn-idle-time duration
--conn-healthcheck-period duration
--min-conns int
workload: add support for remaining pgx query modes
Add support for
pgx.QueryExecModeCacheDescribe
andpgx.QueryExecModeDescribeExec
. Previously, only three of the five query modes were available.workload: fix race condition when recording histogram data
Release note (cli change): workload jitters teardown of connections to prevent thundering herd impacting P99 latency results.
Release note (cli change): workload utility now has flags to tune the connection pool used for testing. See
--conn-healthcheck-period
,--min-conns
, and the--max-conn-*
flags for details.Release note (cli change): workload now supports every PostgreSQL query mode available via the underlying pgx driver.