-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/roachtest: adapt disk-stall detection roachtest #95865
Conversation
Note that the |
d4a4610
to
fdca80d
Compare
Ack, I'll split it into a separate roachtest. |
fdca80d
to
0730e10
Compare
76931c3
to
ff2708c
Compare
ff2708c
to
92dced7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker, @renatolabs, @smg260, and @sumeerbhola)
The `fuse` variant of the `disk-stalled` roachtest was skipped in \cockroachdb#95865. Re-enable the skipped variant, updating it to make use of our forked version of `charybdefs`. This fork includes a patch that allows for specifying a delay time for syscalls, making it possible to simulate a complete disk stall. Previously, delay times were limited to 50ms, which meant that the detection time had to be even lower (e.g. 40ms), which was not representative of how Cockroach is configured in practice. Fixes cockroachdb#95874. Release note: None.
92dced7
to
8fc769e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker, @renatolabs, @smg260, and @sumeerbhola)
a96b188
to
df272f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts on test coverage, but I'm fine with merging this as-is and expanding coverage later.
t test.Test | ||
c cluster.Cluster | ||
readOrWrite []string | ||
logsToo bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have tests where we stall the log device independently of the data device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but it's a little tricky given the volume-level stalling. The FUSE staller makes it a little easier and does implement stalling of the log device independently. @nicktrav will merge his change to fixup and unskip them after this goes in.
c.Run(ctx, c.All(), "mkdir -p logs-external") | ||
c.Run(ctx, c.All(), "ln -s /home/ubuntu/logs-external/cockroach.stdout.log logs/cockroach.stdout.log || true") | ||
c.Run(ctx, c.All(), "ln -s /home/ubuntu/logs-external/cockroach.stderr.log logs/cockroach.stderr.log || true") | ||
c.Run(ctx, c.All(), "ln -s /home/ubuntu/logs-external/roachprod.log logs/roachprod.log || true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is symlinking specific files thorough enough? We may want to place all of the log files on a separate device that can be stalled entirely, to ensure that all log files and log rotation logic and such is also tested properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually exempting these log files from the stalling. I'll remove this; it was a change made while we suspected that the reason the process wouldn't terminate was that cockroach.sh
redirected stdout and stderr to the stalled disk. This turned out to not be the reason, and we've realized there's no way to reliably terminate the process.
return &cgroupDiskStaller{t: t, c: c, readOrWrite: []string{"write", "read"}} | ||
}, | ||
"cgroup/read-write/logs-too=true": func(t test.Test, c cluster.Cluster) diskStaller { | ||
return &cgroupDiskStaller{t: t, c: c, readOrWrite: []string{"write", "read"}, logsToo: true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth testing all combinations of stallers and log/data devices? Or do we feel like a single staller can reliably test all failure modes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do we feel like a single staller can reliably test all failure modes?
Not really, at the moment. Maybe down the line we can consolidate, but we want to be able to exercise exactly the test that a customer performs, regardless of how they perform it.
9639199
to
219ed2d
Compare
Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE filesystem approach to stalling) and skip them for now. Currently, they're not capable of stalling the disk longer 50us (see cockroachdb#95886), which makes them unreliable at exercising stalls. Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use dmsetup and cgroup bandwidth restrctions respectively to reliably induce a write stall for an indefinite duration. Informs cockroachdb#94373. Epic: None Release note: None
219ed2d
to
2e8f273
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ready now, if you want to take a look at the latest changes.
There's still one bit that I'd like to look at, but it can happen separately: When n1
fails, the workload
doesn't shift the connections it had open to n1
over to n2
and n3
. This results in reduced qps for the rest of the test, because it's running with reduced concurrency. I'm pretty sure it's not because the connections are still open to n1
, given I've verified all sockets are closed on n1
.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker, @nicktrav, @renatolabs, @smg260, and @sumeerbhola)
TFTRs! bors r=nicktrav,erikgrinaker |
95622: backupccl,storage: add logs around manifest handling and ExportRequest pagination r=stevendanna a=adityamaru backupccl: add logging to backup manifest handling Release note: None storage: log the ExportRequest pagination reason Release note: None Epic: None 95865: cmd/roachtest: adapt disk-stall detection roachtest r=nicktrav,erikgrinaker a=jbowens Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE filesystem approach to stalling) and skip them for now. Currently, they're not capable of stalling the disk longer 50us (see #95886), which makes them unreliable at exercising stalls. Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use dmsetup and cgroup bandwidth restrctions respectively to reliably induce a write stall for an indefinite duration. Informs #94373. Epic: None Release note: None 95999: multitenant: add multitenant/shared-process/basic roachtest r=stevendanna a=msbutler This patch introduces a simple roachtest that runs in a shared-process tenant. This test imports a 500 tpcc workload (about 30 GB of replicated data), and runs the workload for 10 minutes. The test is run on a 4 node, 4vcpu cluster with local ssds. A future patch could complicate the test by running schema changes or other bulk operations. Fixes #95990 Release note: None 96115: schemachanger: Implement `DROP CONSTRAINT` in declarative schema changer r=Xiang-Gu a=Xiang-Gu This PR implements `ALTER TABLE t DROP CONSTRAINT cons_name` in declarative schema changer. Supported constraints include Checks, FK, and UniqueWithoutIndex. Dropping PK or Unique constraints will fall back to legacy schema changer, which in turn spits out an "not supported yet" error. Epic: None 96202: opt: inverted-index accelerate filters of the form j->0 @> '{"b": "c"} r=Shivs11 a=Shivs11 Previously, the optimizer did not plan inverted index scans for filters having an integer as the index for the fetch value in a filter alongside the "contains" or the "contained by" operator. To address this, we now build JSON arrays from fetch value expressions with integer indexes. From these JSON arrays, inverted spans are built for constraining scans over inverted indexes. With these changes chains of both integer and string fetch value operators are now supported alongside the "contains" and the "contained by" operators. (e.g., j->0 `@>` '{"b": "c"}' and j->0 <@ '{"b": "c"}'). Epic: [CRDB-3301](https://cockroachlabs.atlassian.net/browse/CRDB-3301) Fixes: #94667 Release note (performance improvement): The optimizer now plans inverted index scans for queries that filter by JSON fetch value operators (->) with integer indices alongside the "contains" or the "contained by" operators, e.g, json_col->0 `@>` '{"b": "c"}' or json_col->0 <@ '{"b": "c"}' 96235: sem/tree: add support for producing vectorized data from strings r=cucaroach a=cucaroach tree.ValueHandler exposes raw machine type hooks that are used by vec_handler to build coldata.Vec's. Epic: CRDB-18892 Informs: #91831 Release note: None 96328: udf: allow strict UDF with no arguments r=DrewKimball a=DrewKimball This patch fixes the case when a strict UDF (returns null on null input) has no arguments. Previously, attempting to call such a function would result in `ERROR: reflect: call of reflect.Value.Pointer on zero Value`. Fixes #96326 Release note: None 96366: release: skip nil GitHub events r=celiala a=rail Previously, we referenced `*event.Event`, but in some cases the event objects are `nil`. This PR skips the nil GitHub event objects. Epic: none Release note: None Co-authored-by: adityamaru <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Xiang Gu <[email protected]> Co-authored-by: Shivam Saraf <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Drew Kimball <[email protected]> Co-authored-by: Rail Aliiev <[email protected]>
Build failed (retrying...): |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 2e8f273 to blathers/backport-release-22.2-95865: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
The `fuse` variant of the `disk-stalled` roachtest was skipped in \cockroachdb#95865. Re-enable the skipped variant, updating it to make use of our forked version of `charybdefs`. This fork includes a patch that allows for specifying a delay time for syscalls, making it possible to simulate a complete disk stall. Previously, delay times were limited to 50ms, which meant that the detection time had to be even lower (e.g. 40ms), which was not representative of how Cockroach is configured in practice. Fixes cockroachdb#95874. Release note: None.
The `fuse` variant of the `disk-stalled` roachtest was skipped in \cockroachdb#95865. Re-enable the skipped variant, updating it to make use of our forked version of `charybdefs`. This fork includes a patch that allows for specifying a delay time for syscalls, making it possible to simulate a complete disk stall. Previously, delay times were limited to 50ms, which meant that the detection time had to be even lower (e.g. 40ms), which was not representative of how Cockroach is configured in practice. Allow the roachprod infrastructure to interpolate strings such as `{store-dir}`, etc. when provided as `ExtraArgs` or the `KeyCmd`. Fix by expanding expanding all args, rather than just `ExtraArgs`. Fixes cockroachdb#95874. Release note: None.
96094: roachtest: unskip fuse disk-stall roachtest variant r=jbowens a=nicktrav **roachtest: unskip fuse disk-stall roachtest variant** The `fuse` variant of the `disk-stalled` roachtest was skipped in #95865. Reenable the skipped variant, updating it to make use of our forked version of `charybdefs`. This fork includes a patch that allows for specifying a delay time for syscalls, making it possible to simulate a complete disk stall. Previously, delay times were limited to 50ms, which meant that the detection time had to be even lower (e.g. 40ms), which was not representative of how Cockroach is configured in practice. Fixes #95874. Fixes #95815. Fixes #96410. Fixes #95886. Release note: None. 96130: pkg/compose: bootstrap CCL for TestComposeCompare r=ZhouXing19,rail a=srosenberg The nightly TestComposeCompare integration test uses a randomized SQL workload (a dialect denoted by sqlsmith.PostgresMode) to compare the results against Postgres. Additionally, it runs a second instance of CockroachDB using a series of mutators, e.g., randgen.StatisticsMutator, intended to alter table statistics, indexes, etc., but not the actual table data. Thus, all three data sources are expected to return the same result, for a given SQL query, modulo ignored SQL errors. Some of the generated SQL may require CCL as can be seen from a number of previous test failures. This change bootstraps an environment variable (COCKROACH_DEV_LICENSE), similarly to roachprod and roachtest. The environment variable is passed via CI; it's also in our dev. environment, including gceworker. The test aborts in case COCKROACH_DEV_LICENSE is empty. Epic: none Informs: #82867 Release note: None 96378: builtins: check_consistency now requires admin role r=rafiss a=e-mbrown Resolve #88224 `crdb_internal.check_consistency` is an expensive operation that previously could be invoked by regular users. Now admin role is required. Release note: None Co-authored-by: Nick Travers <[email protected]> Co-authored-by: Stan Rosenberg <[email protected]> Co-authored-by: e-mbrown <[email protected]>
The `fuse` variant of the `disk-stalled` roachtest was skipped in \#95865. Re-enable the skipped variant, updating it to make use of our forked version of `charybdefs`. This fork includes a patch that allows for specifying a delay time for syscalls, making it possible to simulate a complete disk stall. Previously, delay times were limited to 50ms, which meant that the detection time had to be even lower (e.g. 40ms), which was not representative of how Cockroach is configured in practice. Allow the roachprod infrastructure to interpolate strings such as `{store-dir}`, etc. when provided as `ExtraArgs` or the `KeyCmd`. Fix by expanding expanding all args, rather than just `ExtraArgs`. Fixes #95874. Release note: None.
The `fuse` variant of the `disk-stalled` roachtest was skipped in \cockroachdb#95865. Re-enable the skipped variant, updating it to make use of our forked version of `charybdefs`. This fork includes a patch that allows for specifying a delay time for syscalls, making it possible to simulate a complete disk stall. Previously, delay times were limited to 50ms, which meant that the detection time had to be even lower (e.g. 40ms), which was not representative of how Cockroach is configured in practice. Allow the roachprod infrastructure to interpolate strings such as `{store-dir}`, etc. when provided as `ExtraArgs` or the `KeyCmd`. Fix by expanding expanding all args, rather than just `ExtraArgs`. Fixes cockroachdb#95874. Release note: None.
Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE
filesystem approach to stalling) and skip them for now. Currently, they're not
capable of stalling the disk longer 50us (see #95886), which makes them
unreliable at exercising stalls.
Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use
dmsetup and cgroup bandwidth restrctions respectively to reliably induce a
write stall for an indefinite duration.
Informs #94373.
Epic: None
Release note: None