-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtests: introduce admission-control/snapshot-overload #89191
roachtests: introduce admission-control/snapshot-overload #89191
Conversation
Here's a prometheus dump of an experiment run. To see it, run the contained |
I’ll echo the same learnings we had a few months ago — it’s difficult to get into resource starvation due to snapshots, I’ve had to crank things way up to get anything measurable. And even what's measurable is tiny -- a p99 elevation from 23ms to 55ms with foreground load running against a fully loaded LSM observing a lot of incoming snapshots. It's possible this test gets tweaked slightly (using tpcc as the foreground load, or higher read rate, etc.) once we're ready to look at the snapshot overload case again. Perhaps once we're looking to remove snapshot rates altogether. |
89b0534
to
d266bd8
Compare
Are those high cpu and read/write bandwidth plateaus due to snapshot ingestion? |
Release note: None
d266bd8
to
b2a1f33
Compare
Going to take the chance to re-visualize off of the prometheus dump to plot all relevant things onto one frame, focusing on one node in particular. For read-write bandwidth, the two non-zero plateaus correspond to when we're:
For CPU, the six non-zero plateaus correspond to when:
|
This lets one create data dumps off of roachprod clusters that have been collecting data using `roachprod grafana-start`. To use: roachprod grafana-dump $CLUSTER --dump-dir ./ ./prometheus-docker-run.sh The latter sets up a prometheus server on localhost:9090. To point your grafana server to it, you can: brew install grafana brew services start grafana # navigate to localhost:3000
It's possible to pre-load a roachprod grafana instance with specific dashboards using --grafana-config. Consider the following for ex. roachprod grafana-start $CLUSTER \ --grafana-config http://go.crdb.dev/p/snapshot-admission-control-grafana By default these dashboards are not editable. Make sure they are. Release note: None
All variants set it to true so it's not teaching as much. We don't need/want to run decommissioning variants with AC being switched off; we don't foresee switching it off in production clusters either. Release note: None
For some tests to do have expensive initialization steps (large imports for ex.), when iterating on them it's often useful to recycle roachtest created clusters without re-running the full gamut. Introduce this flag to let test authors define such fast paths. Release note: None
This test sets up a 3-node CRDB cluster on 8vCPU machines, loads it up with a large TPC-C dataset, and sets up a foreground load of kv95/1b. The TPC-C dataset has a replication of factor of 1 and is used as the large lump we bus around through high-rate snapshots -- snapshots that are send to the node containing leaseholders serving kv95 traffic. Snapshots follow where the leases travel, cycling through each node, evaluating performance isolation in the presence of snapshots. This test comes 'battery-included', shipping with a custom grafana dashboard and requisite prometheus/grafana plumbing. At the end of test run a prometheus dump is kept around for later inspection. It also integrates with --local and --skip-init for fast iteration. We're going to cargo cult such things in future integration tests for admission control. Release note: None
b2a1f33
to
d8ada8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details.
for the admission_control* files
Reviewed 3 of 17 files at r9.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @renatolabs, and @sumeerbhola)
bors r+ @renatolabs, feel free to leave more comments if you have any, I'll send a follow up PR. Trying to get this in to unblock some other follow on PRs. |
Build succeeded: |
This test sets up a 3-node CRDB cluster on 8vCPU machines, loads it up
with a large TPC-C dataset, and sets up a foreground load of kv95/1b.
The TPC-C dataset has a replication of factor of 1 and is used as the
large lump we bus around through high-rate snapshots -- snapshots that
are send to the node containing leaseholders serving kv95 traffic.
Snapshots follow where the leases travel, cycling through each node,
evaluating performance isolation in the presence of snapshots.
Informs #80607.
This test comes 'battery-included', shipping with a custom grafana
dashboard and requisite prometheus/grafana plumbing. At the end of test
run a prometheus dump is kept around for later inspection. It also
integrates with --local and --skip-init for fast iteration. We're going
to cargo cult such things in future integration tests for admission
control.
Release note: None
There are a few other miscellaneous cleanups/conveniences around the roachtest package for @cockroachdb/test-eng to look at. First three commits are from #89027 and don't need to be looked at here.