-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: use c5, not c5d, for restore 8tb test #98767
Conversation
This is a WIP because the behavior when machine types with local SSD are used is unclear. For example, on AWS, roachtest prefers the c5d family, which all come with local SST storage. But looking into `awsStartupScriptTemplate`, it seems unclear how to make sure that the EBS disk(s) get mounted as /mnt/data1 (which is probably what the default should be). We could also entertain straight-up preventing combinations that would lead to an inhomogeneous RAID0. I imagine we'd have to take a round of failures to find all of the places in which it happens, but perhaps a "snitch" can be inserted instead so that we can detect all such callers and fix them up before arming the check. By the way, EBS disks on AWS come with a default of 125mb/s which is less than this RAID0 gets "most of the time" - so we can expect some tests to behave differently after this change. I still believe this is worth it - debugging is so much harder when you're on top of a storage that's hard to predict and doesn't resemble any production deployment. ---- I wasted weeks of my life on this before, and it almost happened again! When you run a roachtest that asks for an AWS cXd machine (i.e. compute optimized with NVMe local disk), and you specify a VolumeSize, you also get an EBS volume. Prior to these commit, these would be RAID0'ed together. This isn't something sane - the resulting gp3 EBS volume is very different from the local NVMe volume in every way, and it lead to hard-to-understand write throughput behavior. This commit defaults to *not* using RAID0. Touches cockroachdb#98767. Touches cockroachdb#98576. Touches cockroachdb#97019. Epic: none Release note: None
huh, given this finding, do you think we should recommend to customers running on aws with ebs with greater than a couple TBs of data that they should use more than 8vcpu's per node? |
I'm not sure what to recommend yet, since the test was also running on a zombie RAID0 that striped over EBS and local NVMe, see #98782 |
13ca12f
to
e8d7477
Compare
This works around cockroachdb#98783: ``` Instance type c5.2xlarge ``` Now the roachtest runs on standard EBS volumes (provisioned to 125mb/s, i.e. pretty weak ones): ``` $ df -h /mnt/data1/ Filesystem Size Used Avail Use% Mounted on /dev/nvme1n1 2.0T 4.0G 2.0T 1% /mnt/data1 $ sudo nvme list | grep nvme1n1 /dev/nvme1n1 vol065ed9110066bb362 Amazon Elastic Block Store 1 2.15 TB / 2.15 TB 512 B + 0 B 1.0 ``` Let's see how this fares. The theory is that the test previously failed failed due to RAID0 because some nodes would unpredictably be slower than others (depending on the striping, etc, across the raided inhomogeneous volumes), which we don't handle well. Now, there's symmetry and hopefully things will be slower (since we only have 125mb/s per volume now) but functional, i.e. no more OOMs. I verified this via ``` ./pkg/cmd/roachtest/roachstress.sh -c 10 restore/tpce/8TB/aws/nodes=10/cpus=8 -- --cloud aws --parallelism 1 ``` Epic: CRDB-25503 Release note: None
e8d7477
to
abd6420
Compare
Updated this PR @msbutler, it's different from what it was before (the machine type bump wasn't actually changing anything since the problem was RAID0 and not VM-to-EBS bandwidth). I'm happy to merge this if the DR team would like me to. It could take a little while until @srosenberg has the proper fix in (i.e. #98783 is closed), but then the test also doesn't fail too often and you might just want to sit it out. |
This works around #98783:
Now the roachtest runs on standard EBS volumes (provisioned to 125mb/s,
i.e. pretty weak ones):
Let's see how this fares.
The theory is that the test previously failed failed due to RAID0
because some nodes would unpredictably be slower than others (depending
on the striping, etc, across the raided inhomogeneous volumes), which we
don't handle well. Now, there's symmetry and hopefully things will be
slower (since we only have 125mb/s per volume now) but functional, i.e.
no more OOMs.
I verified this via
Closes #97019.
Epic: CRDB-25503
Release note: None