Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add disk-stall variant of failover tests #94240

Merged
merged 2 commits into from
Dec 28, 2022

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Dec 23, 2022

roachtest: reorganize failer interface

This patch reorganizes the failer interface with Setup, Ready, and Cleanup hooks, providing it with access to the monitor. It also passes the test and cluster references via the constructor.

Epic: none
Release note: None

roachtest: add disk-stall variant of failover tests

This patch adds disk-stall variants of the failover tests, which benchmarks the pMax unavailability when a leaseholder experiences a persistent disk stall.

Resolves #94231.
Touches #81100.

Epic: none
Release note: None

@erikgrinaker erikgrinaker requested review from tbg and a team December 23, 2022 15:12
@erikgrinaker erikgrinaker requested a review from a team as a code owner December 23, 2022 15:12
@erikgrinaker erikgrinaker self-assigned this Dec 23, 2022
@erikgrinaker erikgrinaker requested review from herkolategan and srosenberg and removed request for a team December 23, 2022 15:12
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker changed the title Roachtest failover disk stall roachtest: add disk-stall variant of failover tests Dec 23, 2022
@erikgrinaker erikgrinaker requested a review from a team December 23, 2022 15:13
@erikgrinaker erikgrinaker force-pushed the roachtest-failover-disk-stall branch from e8f0963 to f726494 Compare December 23, 2022 15:19
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 23, 2022

@jbowens @nicktrav This one's interesting: even though the disk stalls and the node fails to heartbeat, Pebble's disk stall detector isn't firing. Any thoughts on why? Repro is trivial:

$ roachtest run failover/non-system/disk-stall --debug-always

n4 will be stalled first. If you look at its logs, you'll notice that the stall detector doesn't fire, even though heartbeats fail due to stalled disk writes. The test will eventually stop and restart the node, so be sure to look at the first log.

This simply uses dmsetup suspend to stall the disk. May be something about how it's stalling the disk, although I can confirm that reads and writes are in fact stalled, and I'd expect Pebble to handle all kinds of stalls.

@erikgrinaker erikgrinaker force-pushed the roachtest-failover-disk-stall branch from f726494 to ec9dd5b Compare December 23, 2022 15:28
startSettings install.ClusterSettings
}

func (f *diskStallFailer) Setup(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here?

Does this work on AWS and GCE alike? (No need to check, just eyeball it - for now this runs under GCE anyway)

Copy link
Contributor Author

@erikgrinaker erikgrinaker Dec 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to check that the AWS images contain dmsetup. If they do, it should work, but I'll verify.

Copy link
Contributor Author

@erikgrinaker erikgrinaker Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS uses a different device path, but it works once that was fixed.

func (f *diskStallFailer) Cleanup(ctx context.Context) {
f.c.Run(ctx, f.c.All(), `sudo dmsetup resume data1`)
// We have to stop the cluster to remount /mnt/data1.
f.m.ExpectDeaths(int32(f.c.Spec().NodeCount))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably getting away with it now, but a test that has a dedicated workload node will have trouble here because you'll be killing the workload too (?) but also expecting one more death than you're seeing (which I think can mask later crashes).

You could consider letting the failer operate on a subset of nodes, or just add a comment for now that whoever needs this in the future (if it ever happens) needs to make some changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens during the cleanup stage after the test is done (either successfully or failed). At this point we don't care what happens to the cluster or workload.

This patch reorganizes the `failer` interface with `Setup`, `Ready`, and
`Cleanup` hooks, providing it with access to the monitor. It also passes
the test and cluster references via the constructor.

Epic: none
Release note: None
@erikgrinaker erikgrinaker force-pushed the roachtest-failover-disk-stall branch from ec9dd5b to 948da9e Compare December 28, 2022 13:18
This patch adds `disk-stall` variants of the `failover` tests, which
benchmarks the pMax unavailability when a leaseholder experiences a
persistent disk stall.

Epic: none
Release note: None
@erikgrinaker
Copy link
Contributor Author

Opened #94373 to track the disk stall detector not triggering here. Going to merge this, since I think the benchmark itself does what we want.

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 28, 2022

Build succeeded:

@craig craig bot merged commit 6f675b0 into cockroachdb:master Dec 28, 2022
@erikgrinaker erikgrinaker deleted the roachtest-failover-disk-stall branch January 3, 2023 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: add failover benchmark variant with disk stall
3 participants