-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: bump AWS provisioned write bandwidth for all restore tests #107609
Comments
cc @cockroachdb/replication |
Hi @tbg, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
@cockroachdb/disaster-recovery @msbutler What limits the rate of writes in (as an alternative to increasing the throughput beyond 125 MB/s, which is not free) |
@pavelkalinnikov At the sql level, nothing limits the write rate of restore. At the kvserver level, there are a few things that I'm less familiar with. I know Irfan recently set the concurrentAddSStable limiter to default off on master (not on 23.1) #104861. These tests were written in part to answer the following question: "given a hardware configuraiton and workload, what's our restore throughput"? Since this test seems to be bottlenecked on hardware, it seems reasonable to use beefier hardware. I'm going to chat with my team about it tomorrow. On the flipside, it would be a bad look to tell customers "restore needs to be run on beefier machines or else a node could oom". |
@msbutler The prototype for provisioning for extra throughput is in #108427. I've tested it and compared 125 and 250 MB/s, see the above message. At 125 we're maxing out the throughput, at 250 we have some leeway, so I think we should be good with 250? Agreeing with your point that we should fix the OOMs rather than require beefier machines. We are probably doing it next, but we want to avoid unnecessary test flakes in the meantime. |
@pavelkalinnikov thanks for experimenting with this here! Here's what the DR team thinks:
Some follow up questions:
|
Thanks @msbutler. The plan makes sense.
The OOMs in this test manifest in interaction between raft nodes, e.g. when many leaders (on the same node) queue up too many / large log entries going to followers [#73376], or when a follower is slow and receives and buffers too many / large incoming updates. I think it's best to repro with 3 nodes/replicas, see some ideas that @tbg noted here.
I think we will likely be considering it for 24.1. |
Reopening this issue, to bump throughput in other tests too. Currently, other On GCE, a same scale test @msbutler I think we should bump throughput to 250 MB/s on all AWS |
109221: roachtest: provision 250 MB/s for restore tests on AWS r=pavelkalinnikov a=pavelkalinnikov The `restore/tpce/*` family of tests on AWS max out the default 125 MB/s EBS throughput. In contrast, similar tests in GCE provision for more throughput and [don't max it out](#107609 (comment)). This commit bumps the provisioned throughput from 125 MB/s to 250 MB/s in all `restore` tests on AWS, so that the tests don't work at the edge of overload. This both brings some parity between testing on GCE and AWS, and reduces likelihood of raft OOMs (which manifest more often when disk is overloaded). Fixes #107609 Touches #106248 Epic: none Release note: none Co-authored-by: Pavel Kalinnikov <[email protected]>
Describe the problem
A number of restore tests are failing on AWS with OOMs in the replication layer.
While we improve the memory accounting in the replication layer, it's not useful for these tests to keep failing.
We are fairly confident that the ooms can be traced to disk overload. Default AWS provisioned gp3 volumes come with combined read+write throughput of 125mb/s, which is routinely maxed out in tests. Combined with load imbalances in the cluster, this can lead to OOMs in the replication layer today.
We should double the provisioned write bandwidth and make sure that this is the default on AWS for the restore tests. This should be easy, as these tests have their own little framework (i.e. all tests go through a central code path).
Holistically addressing the OOMs is tracked in CRDB-25503.
Jira issue: CRDB-30126
The text was updated successfully, but these errors were encountered: