-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add disk-stall
variant of failover
tests
#94240
roachtest: add disk-stall
variant of failover
tests
#94240
Conversation
disk-stall
variant of failover
tests
e8f0963
to
f726494
Compare
@jbowens @nicktrav This one's interesting: even though the disk stalls and the node fails to heartbeat, Pebble's disk stall detector isn't firing. Any thoughts on why? Repro is trivial:
n4 will be stalled first. If you look at its logs, you'll notice that the stall detector doesn't fire, even though heartbeats fail due to stalled disk writes. The test will eventually stop and restart the node, so be sure to look at the first log. This simply uses |
f726494
to
ec9dd5b
Compare
startSettings install.ClusterSettings | ||
} | ||
|
||
func (f *diskStallFailer) Setup(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment here?
Does this work on AWS and GCE alike? (No need to check, just eyeball it - for now this runs under GCE anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to check that the AWS images contain dmsetup
. If they do, it should work, but I'll verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AWS uses a different device path, but it works once that was fixed.
func (f *diskStallFailer) Cleanup(ctx context.Context) { | ||
f.c.Run(ctx, f.c.All(), `sudo dmsetup resume data1`) | ||
// We have to stop the cluster to remount /mnt/data1. | ||
f.m.ExpectDeaths(int32(f.c.Spec().NodeCount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're probably getting away with it now, but a test that has a dedicated workload node will have trouble here because you'll be killing the workload too (?) but also expecting one more death than you're seeing (which I think can mask later crashes).
You could consider letting the failer operate on a subset of nodes, or just add a comment for now that whoever needs this in the future (if it ever happens) needs to make some changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens during the cleanup stage after the test is done (either successfully or failed). At this point we don't care what happens to the cluster or workload.
This patch reorganizes the `failer` interface with `Setup`, `Ready`, and `Cleanup` hooks, providing it with access to the monitor. It also passes the test and cluster references via the constructor. Epic: none Release note: None
ec9dd5b
to
948da9e
Compare
This patch adds `disk-stall` variants of the `failover` tests, which benchmarks the pMax unavailability when a leaseholder experiences a persistent disk stall. Epic: none Release note: None
948da9e
to
42e6639
Compare
Opened #94373 to track the disk stall detector not triggering here. Going to merge this, since I think the benchmark itself does what we want. bors r+ |
Build succeeded: |
roachtest: reorganize failer interface
This patch reorganizes the
failer
interface withSetup
,Ready
, andCleanup
hooks, providing it with access to the monitor. It also passes the test and cluster references via the constructor.Epic: none
Release note: None
roachtest: add
disk-stall
variant offailover
testsThis patch adds
disk-stall
variants of thefailover
tests, which benchmarks the pMax unavailability when a leaseholder experiences a persistent disk stall.Resolves #94231.
Touches #81100.
Epic: none
Release note: None