Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: bump AWS provisioned write bandwidth for all restore tests #107609

Closed
tbg opened this issue Jul 26, 2023 · 9 comments · Fixed by #108427 or #109221
Closed

roachtest: bump AWS provisioned write bandwidth for all restore tests #107609

tbg opened this issue Jul 26, 2023 · 9 comments · Fixed by #108427 or #109221
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@tbg
Copy link
Member

tbg commented Jul 26, 2023

Describe the problem

A number of restore tests are failing on AWS with OOMs in the replication layer.

While we improve the memory accounting in the replication layer, it's not useful for these tests to keep failing.

We are fairly confident that the ooms can be traced to disk overload. Default AWS provisioned gp3 volumes come with combined read+write throughput of 125mb/s, which is routinely maxed out in tests. Combined with load imbalances in the cluster, this can lead to OOMs in the replication layer today.

We should double the provisioned write bandwidth and make sure that this is the default on AWS for the restore tests. This should be easy, as these tests have their own little framework (i.e. all tests go through a central code path).

Holistically addressing the OOMs is tracked in CRDB-25503.

Jira issue: CRDB-30126

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv-replication labels Jul 26, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 26, 2023

cc @cockroachdb/replication

@blathers-crl
Copy link

blathers-crl bot commented Jul 26, 2023

Hi @tbg, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@tbg tbg added the branch-master Failures and bugs on the master branch. label Jul 26, 2023
@exalate-issue-sync exalate-issue-sync bot added branch-master Failures and bugs on the master branch. and removed branch-master Failures and bugs on the master branch. labels Jul 26, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Aug 8, 2023

@cockroachdb/disaster-recovery @msbutler What limits the rate of writes in restore tests? Can we reduce it a bit so that we don't eat up all the disk throughput as on this graph? Or are we interested in running these tests at the edge of capacity for max performance tracking purposes?

(as an alternative to increasing the throughput beyond 125 MB/s, which is not free)

@msbutler
Copy link
Collaborator

msbutler commented Aug 8, 2023

@pavelkalinnikov At the sql level, nothing limits the write rate of restore. At the kvserver level, there are a few things that I'm less familiar with. I know Irfan recently set the concurrentAddSStable limiter to default off on master (not on 23.1) #104861.

These tests were written in part to answer the following question: "given a hardware configuraiton and workload, what's our restore throughput"? Since this test seems to be bottlenecked on hardware, it seems reasonable to use beefier hardware. I'm going to chat with my team about it tomorrow.

On the flipside, it would be a bad look to tell customers "restore needs to be run on beefier machines or else a node could oom".

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 9, 2023

Comparing performance of restore/tpce/8TB/aws/nodes=10/cpus=8 test with 125 MB/s and 250 MB/s disks.

125 MB/s 250 MB/s
CPU ~30-50% ~60%
Mem ~6 GB / 16 GB ~6 GB / 16 GB
Read <5 MB/s 2-10 MB/s
Write <125 MB/s 70-170 MB/s
Time 3h40m / 5h 3h10m / 5h
With provisioned 125 MB/s Screenshot 2023-08-09 at 15 56 36 Screenshot 2023-08-09 at 15 56 45 Screenshot 2023-08-09 at 15 56 51 Screenshot 2023-08-09 at 15 56 57
With provisioned 250 MB/s Screenshot 2023-08-09 at 15 05 46 Screenshot 2023-08-09 at 15 05 56 Screenshot 2023-08-09 at 15 06 05 Screenshot 2023-08-09 at 15 06 22

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 9, 2023

@msbutler The prototype for provisioning for extra throughput is in #108427. I've tested it and compared 125 and 250 MB/s, see the above message. At 125 we're maxing out the throughput, at 250 we have some leeway, so I think we should be good with 250?

Agreeing with your point that we should fix the OOMs rather than require beefier machines. We are probably doing it next, but we want to avoid unnecessary test flakes in the meantime.

@msbutler
Copy link
Collaborator

msbutler commented Aug 9, 2023

@pavelkalinnikov thanks for experimenting with this here! Here's what the DR team thinks:

  1. Since this 8TB test is clearly disk bandwidth constrained, and since these tests attempt to find software bottlenecks and not hardware bottlenecks, we think it makes sense to bump disk bandwidth on this this test for now. I believe bumping disk bandwith will only cost use an extra $10 a month, assuming same test runtime (pessimistic), but do check my math!. I don't think it makes sense to bump machine size as well.
  2. Once your prototype lands, I can open a new restore/400GB test that also bumps the disk bandwidth to 250 MB/s and keep the existing tests as is. This OOM only surfaces like once a month, so I don't think there's much investigative cost (open a heap profile, see that raft OOmed).
  3. I will design a restore roachtest (which can stay skipped), which attempts to saturate disk bandwidth and reliably produces OOMs. My plan is to run a 400GB restore on a single node cluster. This test could help us tune various knobs that could avoid OOMs while we wait for raft memory monitoring to land. I lied about sql level knobs: restore can limit the number of workers per node that send addsstable requests.

Some follow up questions:

  • When do you expect raft memory monitoring to land?

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 9, 2023

Thanks @msbutler. The plan makes sense.

I will design a restore roachtest (which can stay skipped), which attempts to saturate disk bandwidth and reliably produces OOMs. My plan is to run a 400GB restore on a single node cluster.

The OOMs in this test manifest in interaction between raft nodes, e.g. when many leaders (on the same node) queue up too many / large log entries going to followers [#73376], or when a follower is slow and receives and buffers too many / large incoming updates. I think it's best to repro with 3 nodes/replicas, see some ideas that @tbg noted here.

When do you expect raft memory monitoring to land?

I think we will likely be considering it for 24.1.

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 22, 2023

Reopening this issue, to bump throughput in other tests too.

Currently, other restore tests (even small 400GB ones) max out the 125 MB/s throughput, e.g. see restore/tpce/400GB/aws/nodes=4/cpus=8 at #106248 (comment) and restore/tpce/8TB/aws/nodes=10/cpus=8 at #107609 (comment).

On GCE, a same scale test restore/tpce/400GB/gce/nodes=4/cpus=8 gets disks with larger throughput which is not usually maxed out:

Screenshot 2023-08-22 at 09 43 31

@msbutler I think we should bump throughput to 250 MB/s on all AWS restore tests, as this issue originally suggested. This would both reduce likelihood of OOMs, and will bring some parity between tests.

@pav-kv pav-kv reopened this Aug 22, 2023
craig bot pushed a commit that referenced this issue Aug 22, 2023
109221: roachtest: provision 250 MB/s for restore tests on AWS r=pavelkalinnikov a=pavelkalinnikov

The `restore/tpce/*` family of tests on AWS max out the default 125 MB/s EBS throughput. In contrast, similar tests in GCE provision for more throughput and [don't max it out](#107609 (comment)).

This commit bumps the provisioned throughput from 125 MB/s to 250 MB/s in all `restore` tests on AWS, so that the tests don't work at the edge of overload.

This both brings some parity between testing on GCE and AWS, and reduces likelihood of raft OOMs (which manifest more often when disk is overloaded).

Fixes #107609
Touches #106248
Epic: none
Release note: none

Co-authored-by: Pavel Kalinnikov <[email protected]>
@craig craig bot closed this as completed in 0c4a65b Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
3 participants