roachtest: bump AWS provisioned write bandwidth for all restore tests #107609

tbg · 2023-07-26T10:33:06Z

Describe the problem

A number of restore tests are failing on AWS with OOMs in the replication layer.

While we improve the memory accounting in the replication layer, it's not useful for these tests to keep failing.

We are fairly confident that the ooms can be traced to disk overload. Default AWS provisioned gp3 volumes come with combined read+write throughput of 125mb/s, which is routinely maxed out in tests. Combined with load imbalances in the cluster, this can lead to OOMs in the replication layer today.

We should double the provisioned write bandwidth and make sure that this is the default on AWS for the restore tests. This should be easy, as these tests have their own little framework (i.e. all tests go through a central code path).

Holistically addressing the OOMs is tracked in CRDB-25503.

Jira issue: CRDB-30126

blathers-crl · 2023-07-26T10:33:09Z

cc @cockroachdb/replication

blathers-crl · 2023-07-26T10:33:09Z

Hi @tbg, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

pav-kv · 2023-08-08T14:54:18Z

@cockroachdb/disaster-recovery @msbutler What limits the rate of writes in restore tests? Can we reduce it a bit so that we don't eat up all the disk throughput as on this graph? Or are we interested in running these tests at the edge of capacity for max performance tracking purposes?

(as an alternative to increasing the throughput beyond 125 MB/s, which is not free)

msbutler · 2023-08-08T16:11:56Z

@pavelkalinnikov At the sql level, nothing limits the write rate of restore. At the kvserver level, there are a few things that I'm less familiar with. I know Irfan recently set the concurrentAddSStable limiter to default off on master (not on 23.1) #104861.

These tests were written in part to answer the following question: "given a hardware configuraiton and workload, what's our restore throughput"? Since this test seems to be bottlenecked on hardware, it seems reasonable to use beefier hardware. I'm going to chat with my team about it tomorrow.

On the flipside, it would be a bad look to tell customers "restore needs to be run on beefier machines or else a node could oom".

pav-kv · 2023-08-09T14:15:18Z

Comparing performance of restore/tpce/8TB/aws/nodes=10/cpus=8 test with 125 MB/s and 250 MB/s disks.

	125 MB/s	250 MB/s
CPU	~30-50%	~60%
Mem	~6 GB / 16 GB	~6 GB / 16 GB
Read	<5 MB/s	2-10 MB/s
Write	<125 MB/s	70-170 MB/s
Time	3h40m / 5h	3h10m / 5h

With provisioned 125 MB/s

With provisioned 250 MB/s

pav-kv · 2023-08-09T14:24:17Z

@msbutler The prototype for provisioning for extra throughput is in #108427. I've tested it and compared 125 and 250 MB/s, see the above message. At 125 we're maxing out the throughput, at 250 we have some leeway, so I think we should be good with 250?

Agreeing with your point that we should fix the OOMs rather than require beefier machines. We are probably doing it next, but we want to avoid unnecessary test flakes in the meantime.

msbutler · 2023-08-09T15:53:43Z

@pavelkalinnikov thanks for experimenting with this here! Here's what the DR team thinks:

Since this 8TB test is clearly disk bandwidth constrained, and since these tests attempt to find software bottlenecks and not hardware bottlenecks, we think it makes sense to bump disk bandwidth on this this test for now. I believe bumping disk bandwith will only cost use an extra $10 a month, assuming same test runtime (pessimistic), but do check my math!. I don't think it makes sense to bump machine size as well.
Once your prototype lands, I can open a new restore/400GB test that also bumps the disk bandwidth to 250 MB/s and keep the existing tests as is. This OOM only surfaces like once a month, so I don't think there's much investigative cost (open a heap profile, see that raft OOmed).
I will design a restore roachtest (which can stay skipped), which attempts to saturate disk bandwidth and reliably produces OOMs. My plan is to run a 400GB restore on a single node cluster. This test could help us tune various knobs that could avoid OOMs while we wait for raft memory monitoring to land. I lied about sql level knobs: restore can limit the number of workers per node that send addsstable requests.

Some follow up questions:

When do you expect raft memory monitoring to land?

pav-kv · 2023-08-09T17:53:09Z

Thanks @msbutler. The plan makes sense.

I will design a restore roachtest (which can stay skipped), which attempts to saturate disk bandwidth and reliably produces OOMs. My plan is to run a 400GB restore on a single node cluster.

The OOMs in this test manifest in interaction between raft nodes, e.g. when many leaders (on the same node) queue up too many / large log entries going to followers [#73376], or when a follower is slow and receives and buffers too many / large incoming updates. I think it's best to repro with 3 nodes/replicas, see some ideas that @tbg noted here.

When do you expect raft memory monitoring to land?

I think we will likely be considering it for 24.1.

pav-kv · 2023-08-22T08:51:32Z

Reopening this issue, to bump throughput in other tests too.

Currently, other restore tests (even small 400GB ones) max out the 125 MB/s throughput, e.g. see restore/tpce/400GB/aws/nodes=4/cpus=8 at #106248 (comment) and restore/tpce/8TB/aws/nodes=10/cpus=8 at #107609 (comment).

On GCE, a same scale test restore/tpce/400GB/gce/nodes=4/cpus=8 gets disks with larger throughput which is not usually maxed out:

@msbutler I think we should bump throughput to 250 MB/s on all AWS restore tests, as this issue originally suggested. This would both reduce likelihood of OOMs, and will bring some parity between tests.

109221: roachtest: provision 250 MB/s for restore tests on AWS r=pavelkalinnikov a=pavelkalinnikov The `restore/tpce/*` family of tests on AWS max out the default 125 MB/s EBS throughput. In contrast, similar tests in GCE provision for more throughput and [don't max it out](#107609 (comment)). This commit bumps the provisioned throughput from 125 MB/s to 250 MB/s in all `restore` tests on AWS, so that the tests don't work at the edge of overload. This both brings some parity between testing on GCE and AWS, and reduces likelihood of raft OOMs (which manifest more often when disk is overloaded). Fixes #107609 Touches #106248 Epic: none Release note: none Co-authored-by: Pavel Kalinnikov <[email protected]>

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv-replication labels Jul 26, 2023

tbg assigned pav-kv Jul 26, 2023

tbg added the branch-master Failures and bugs on the master branch. label Jul 26, 2023

exalate-issue-sync bot added branch-master Failures and bugs on the master branch. and removed branch-master Failures and bugs on the master branch. labels Jul 26, 2023

pav-kv mentioned this issue Aug 7, 2023

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [CRDB-25503 replication send oom] #106496

Closed

pav-kv mentioned this issue Aug 9, 2023

roachtest: provision 250 MB/s for 8TB restore test #108427

Merged

pav-kv mentioned this issue Aug 9, 2023

roachtest: double machine size for 8TB restore test #108350

Closed

craig bot closed this as completed in 1fd3eb8 Aug 10, 2023

This was referenced Aug 10, 2023

release-23.1: roachtest: provision 250 MB/s for 8TB restore test #108513

Merged

roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication send oom] #106248

Closed

pav-kv reopened this Aug 22, 2023

pav-kv mentioned this issue Aug 22, 2023

roachtest: provision 250 MB/s for restore tests on AWS #109221

Merged

craig bot closed this as completed in 0c4a65b Aug 22, 2023

pav-kv mentioned this issue Aug 22, 2023

release-23.1: roachtest: provision 250 MB/s for restore tests on AWS #109278

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: bump AWS provisioned write bandwidth for all restore tests #107609

roachtest: bump AWS provisioned write bandwidth for all restore tests #107609

tbg commented Jul 26, 2023 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Jul 26, 2023

blathers-crl bot commented Jul 26, 2023

pav-kv commented Aug 8, 2023 •

edited

Loading

msbutler commented Aug 8, 2023 •

edited

Loading

pav-kv commented Aug 9, 2023 •

edited

Loading

pav-kv commented Aug 9, 2023 •

edited

Loading

msbutler commented Aug 9, 2023 •

edited

Loading

pav-kv commented Aug 9, 2023 •

edited

Loading

pav-kv commented Aug 22, 2023 •

edited

Loading

roachtest: bump AWS provisioned write bandwidth for all restore tests #107609

roachtest: bump AWS provisioned write bandwidth for all restore tests #107609

Comments

tbg commented Jul 26, 2023 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Jul 26, 2023

blathers-crl bot commented Jul 26, 2023

pav-kv commented Aug 8, 2023 • edited Loading

msbutler commented Aug 8, 2023 • edited Loading

pav-kv commented Aug 9, 2023 • edited Loading

pav-kv commented Aug 9, 2023 • edited Loading

msbutler commented Aug 9, 2023 • edited Loading

pav-kv commented Aug 9, 2023 • edited Loading

pav-kv commented Aug 22, 2023 • edited Loading

tbg commented Jul 26, 2023 •

edited by exalate-issue-sync bot

Loading

pav-kv commented Aug 8, 2023 •

edited

Loading

msbutler commented Aug 8, 2023 •

edited

Loading

pav-kv commented Aug 9, 2023 •

edited

Loading

pav-kv commented Aug 9, 2023 •

edited

Loading

msbutler commented Aug 9, 2023 •

edited

Loading

pav-kv commented Aug 9, 2023 •

edited

Loading

pav-kv commented Aug 22, 2023 •

edited

Loading