Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: use c5, not c5d, for restore 8tb test #98767

Closed
wants to merge 1 commit into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Mar 16, 2023

This works around #98783:

Instance type
c5.2xlarge

Now the roachtest runs on standard EBS volumes (provisioned to 125mb/s,
i.e. pretty weak ones):

$ df -h /mnt/data1/
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    2.0T  4.0G  2.0T   1% /mnt/data1
$ sudo nvme list | grep nvme1n1
/dev/nvme1n1     vol065ed9110066bb362 Amazon Elastic Block Store               1           2.15  TB /   2.15  TB    512   B +  0 B   1.0

Let's see how this fares.

The theory is that the test previously failed failed due to RAID0
because some nodes would unpredictably be slower than others (depending
on the striping, etc, across the raided inhomogeneous volumes), which we
don't handle well. Now, there's symmetry and hopefully things will be
slower (since we only have 125mb/s per volume now) but functional, i.e.
no more OOMs.

I verified this via

./pkg/cmd/roachtest/roachstress.sh -c 10 restore/tpce/8TB/aws/nodes=10/cpus=8 -- --cloud aws --parallelism 1

Closes #97019.

Epic: CRDB-25503
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

tbg added a commit to tbg/cockroach that referenced this pull request Mar 16, 2023
This is a WIP because the behavior when machine types with local SSD are
used is unclear. For example, on AWS, roachtest prefers the c5d family,
which all come with local SST storage. But looking into
`awsStartupScriptTemplate`, it seems unclear how to make sure that the
EBS disk(s) get mounted as /mnt/data1 (which is probably what the
default should be).

We could also entertain straight-up preventing combinations that would
lead to an inhomogeneous RAID0. I imagine we'd have to take a round of
failures to find all of the places in which it happens, but perhaps
a "snitch" can be inserted instead so that we can detect all such
callers and fix them up before arming the check.

By the way, EBS disks on AWS come with a default of 125mb/s which is
less than this RAID0 gets "most of the time" - so we can expect some
tests to behave differently after this change. I still believe this
is worth it - debugging is so much harder when you're on top of a
storage that's hard to predict and doesn't resemble any production
deployment.

----

I wasted weeks of my life on this before, and it almost happened again!
When you run a roachtest that asks for an AWS cXd machine (i.e. compute
optimized with NVMe local disk), and you specify a VolumeSize, you also
get an EBS volume. Prior to these commit, these would be RAID0'ed
together.

This isn't something sane - the resulting gp3 EBS volume is very
different from the local NVMe volume in every way, and it lead to
hard-to-understand write throughput behavior.

This commit defaults to *not* using RAID0.

Touches cockroachdb#98767.
Touches cockroachdb#98576.
Touches cockroachdb#97019.

Epic: none
Release note: None
@msbutler
Copy link
Collaborator

huh, given this finding, do you think we should recommend to customers running on aws with ebs with greater than a couple TBs of data that they should use more than 8vcpu's per node?

@tbg
Copy link
Member Author

tbg commented Mar 20, 2023

I'm not sure what to recommend yet, since the test was also running on a zombie RAID0 that striped over EBS and local NVMe, see #98782
I think that RAID0 introduced asymmetry, i.e. some nodes falling behind in throughput, and this is ultimately what's causing the problems.

@tbg tbg force-pushed the restore-8tb-more-ebs-bandwidth branch 5 times, most recently from 13ca12f to e8d7477 Compare March 20, 2023 20:36
This works around cockroachdb#98783:

```
Instance type
c5.2xlarge
```

Now the roachtest runs on standard EBS volumes (provisioned to 125mb/s,
i.e. pretty weak ones):

```
$ df -h /mnt/data1/
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    2.0T  4.0G  2.0T   1% /mnt/data1
$ sudo nvme list | grep nvme1n1
/dev/nvme1n1     vol065ed9110066bb362 Amazon Elastic Block Store               1           2.15  TB /   2.15  TB    512   B +  0 B   1.0
```

Let's see how this fares.

The theory is that the test previously failed failed due to RAID0
because some nodes would unpredictably be slower than others (depending
on the striping, etc, across the raided inhomogeneous volumes), which we
don't handle well. Now, there's symmetry and hopefully things will be
slower (since we only have 125mb/s per volume now) but functional, i.e.
no more OOMs.

I verified this via

```
./pkg/cmd/roachtest/roachstress.sh -c 10 restore/tpce/8TB/aws/nodes=10/cpus=8 -- --cloud aws --parallelism 1
```

Epic: CRDB-25503
Release note: None
@tbg tbg force-pushed the restore-8tb-more-ebs-bandwidth branch from e8d7477 to abd6420 Compare March 20, 2023 20:43
@tbg tbg changed the title roachtest: use m2d.4xlarge for 8tb restore test roachtest: use m5, not m5d, for restore 8tb test Mar 21, 2023
@tbg tbg changed the title roachtest: use m5, not m5d, for restore 8tb test roachtest: use c5, not c5d, for restore 8tb test Mar 21, 2023
@tbg
Copy link
Member Author

tbg commented Mar 22, 2023

Updated this PR @msbutler, it's different from what it was before (the machine type bump wasn't actually changing anything since the problem was RAID0 and not VM-to-EBS bandwidth).

I'm happy to merge this if the DR team would like me to. It could take a little while until @srosenberg has the proper fix in (i.e. #98783 is closed), but then the test also doesn't fail too often and you might just want to sit it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed
3 participants