Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/roachtest: adapt disk-stall detection roachtest #95865

Merged
merged 1 commit into from
Feb 2, 2023

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Jan 25, 2023

Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE
filesystem approach to stalling) and skip them for now. Currently, they're not
capable of stalling the disk longer 50us (see #95886), which makes them
unreliable at exercising stalls.

Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use
dmsetup and cgroup bandwidth restrctions respectively to reliably induce a
write stall for an indefinite duration.

Informs #94373.
Epic: None
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jan 25, 2023

Note that the failover tests are meant to be benchmarks that measure lease recovery times under various failure modes, not pass/fail tests, as they're graphed nightly and we want to track lease recovery performance over time. Any failing tests would probably prevent the data point from being recorded (?). If we want a pass/fail test, we should probably consider adding a separate roachtest for it?

@jbowens jbowens force-pushed the diskstall-roachtest branch from d4a4610 to fdca80d Compare January 25, 2023 21:29
@jbowens
Copy link
Collaborator Author

jbowens commented Jan 25, 2023

Ack, I'll split it into a separate roachtest.

@jbowens jbowens force-pushed the diskstall-roachtest branch from fdca80d to 0730e10 Compare January 26, 2023 02:15
@jbowens jbowens changed the title roachtest: require disk stall to terminate node cmd/roachtest: adapt disk-stall detection roachtest Jan 26, 2023
@jbowens jbowens force-pushed the diskstall-roachtest branch 5 times, most recently from 76931c3 to ff2708c Compare January 26, 2023 20:59
@jbowens jbowens marked this pull request as ready for review January 26, 2023 21:00
@jbowens jbowens requested a review from a team as a code owner January 26, 2023 21:00
@jbowens jbowens requested review from smg260, renatolabs, a team, sumeerbhola and erikgrinaker and removed request for a team January 26, 2023 21:00
@jbowens jbowens force-pushed the diskstall-roachtest branch from ff2708c to 92dced7 Compare January 26, 2023 21:22
Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! :lgtm:

Reviewed 1 of 2 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker, @renatolabs, @smg260, and @sumeerbhola)

nicktrav added a commit to nicktrav/cockroach that referenced this pull request Jan 27, 2023
The `fuse` variant of the `disk-stalled` roachtest was skipped in
\cockroachdb#95865.

Re-enable the skipped variant, updating it to make use of our forked
version of `charybdefs`. This fork includes a patch that allows for
specifying a delay time for syscalls, making it possible to simulate a
complete disk stall. Previously, delay times were limited to 50ms, which
meant that the detection time had to be even lower (e.g. 40ms), which
was not representative of how Cockroach is configured in practice.

Fixes cockroachdb#95874.

Release note: None.
@jbowens jbowens force-pushed the diskstall-roachtest branch from 92dced7 to 8fc769e Compare January 27, 2023 15:50
Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker, @renatolabs, @smg260, and @sumeerbhola)

@jbowens jbowens force-pushed the diskstall-roachtest branch 2 times, most recently from a96b188 to df272f3 Compare January 27, 2023 17:32
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts on test coverage, but I'm fine with merging this as-is and expanding coverage later.

t test.Test
c cluster.Cluster
readOrWrite []string
logsToo bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have tests where we stall the log device independently of the data device.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but it's a little tricky given the volume-level stalling. The FUSE staller makes it a little easier and does implement stalling of the log device independently. @nicktrav will merge his change to fixup and unskip them after this goes in.

c.Run(ctx, c.All(), "mkdir -p logs-external")
c.Run(ctx, c.All(), "ln -s /home/ubuntu/logs-external/cockroach.stdout.log logs/cockroach.stdout.log || true")
c.Run(ctx, c.All(), "ln -s /home/ubuntu/logs-external/cockroach.stderr.log logs/cockroach.stderr.log || true")
c.Run(ctx, c.All(), "ln -s /home/ubuntu/logs-external/roachprod.log logs/roachprod.log || true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is symlinking specific files thorough enough? We may want to place all of the log files on a separate device that can be stalled entirely, to ensure that all log files and log rotation logic and such is also tested properly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually exempting these log files from the stalling. I'll remove this; it was a change made while we suspected that the reason the process wouldn't terminate was that cockroach.sh redirected stdout and stderr to the stalled disk. This turned out to not be the reason, and we've realized there's no way to reliably terminate the process.

return &cgroupDiskStaller{t: t, c: c, readOrWrite: []string{"write", "read"}}
},
"cgroup/read-write/logs-too=true": func(t test.Test, c cluster.Cluster) diskStaller {
return &cgroupDiskStaller{t: t, c: c, readOrWrite: []string{"write", "read"}, logsToo: true}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth testing all combinations of stallers and log/data devices? Or do we feel like a single staller can reliably test all failure modes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do we feel like a single staller can reliably test all failure modes?

Not really, at the moment. Maybe down the line we can consolidate, but we want to be able to exercise exactly the test that a customer performs, regardless of how they perform it.

@jbowens jbowens force-pushed the diskstall-roachtest branch 8 times, most recently from 9639199 to 219ed2d Compare January 31, 2023 17:57
Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE
filesystem approach to stalling) and skip them for now. Currently, they're not
capable of stalling the disk longer 50us (see cockroachdb#95886), which makes them
unreliable at exercising stalls.

Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use
dmsetup and cgroup bandwidth restrctions respectively to reliably induce a
write stall for an indefinite duration.

Informs cockroachdb#94373.
Epic: None
Release note: None
@jbowens jbowens force-pushed the diskstall-roachtest branch from 219ed2d to 2e8f273 Compare January 31, 2023 20:24
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready now, if you want to take a look at the latest changes.

There's still one bit that I'd like to look at, but it can happen separately: When n1 fails, the workload doesn't shift the connections it had open to n1 over to n2 and n3. This results in reduced qps for the rest of the test, because it's running with reduced concurrency. I'm pretty sure it's not because the connections are still open to n1, given I've verified all sockets are closed on n1.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker, @nicktrav, @renatolabs, @smg260, and @sumeerbhola)

@jbowens
Copy link
Collaborator Author

jbowens commented Feb 1, 2023

TFTRs!

bors r=nicktrav,erikgrinaker

craig bot pushed a commit that referenced this pull request Feb 1, 2023
95622: backupccl,storage: add logs around manifest handling and ExportRequest pagination r=stevendanna a=adityamaru

backupccl: add logging to backup manifest handling

Release note: None

storage: log the ExportRequest pagination reason

Release note: None

Epic: None

95865: cmd/roachtest: adapt disk-stall detection roachtest r=nicktrav,erikgrinaker a=jbowens

Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE
filesystem approach to stalling) and skip them for now. Currently, they're not
capable of stalling the disk longer 50us (see #95886), which makes them
unreliable at exercising stalls.

Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use
dmsetup and cgroup bandwidth restrctions respectively to reliably induce a
write stall for an indefinite duration.

Informs #94373.
Epic: None
Release note: None

95999: multitenant: add multitenant/shared-process/basic roachtest r=stevendanna a=msbutler

This patch introduces a simple roachtest that runs in a shared-process tenant.
This test imports a 500 tpcc workload (about 30 GB of replicated data), and
runs the workload for 10 minutes. The test is run on a 4 node, 4vcpu cluster
with local ssds.

A future patch could complicate the test by running schema changes or other
bulk operations.

Fixes #95990

Release note: None

96115: schemachanger: Implement `DROP CONSTRAINT` in declarative schema changer r=Xiang-Gu a=Xiang-Gu

This PR implements `ALTER TABLE t DROP CONSTRAINT cons_name` in declarative schema changer.

Supported constraints include Checks, FK, and UniqueWithoutIndex.

Dropping PK or Unique constraints will fall back to legacy schema changer, which in turn spits out an "not supported yet" error.

Epic: None

96202: opt: inverted-index accelerate filters of the form j->0 @> '{"b": "c"} r=Shivs11 a=Shivs11

Previously, the optimizer did not plan inverted index scans for filters
having an integer as the index for the fetch value in a filter alongside
the "contains" or the "contained by" operator.

To address this, we now build JSON arrays from fetch value expressions
with integer indexes. From these JSON arrays, inverted spans are built
for constraining scans over inverted indexes. With these changes chains
of both integer and string fetch value operators are now supported
alongside the "contains" and the "contained by" operators.
(e.g., j->0 `@>` '{"b": "c"}' and j->0 <@ '{"b": "c"}').

Epic: [CRDB-3301](https://cockroachlabs.atlassian.net/browse/CRDB-3301)
Fixes: #94667

Release note (performance improvement): The optimizer now plans
inverted index scans for queries that filter by JSON fetch value
operators (->) with integer indices alongside the "contains" or
the "contained by" operators, e.g, json_col->0 `@>` '{"b": "c"}'
or json_col->0 <@ '{"b": "c"}'

96235: sem/tree: add support for producing vectorized data from strings r=cucaroach a=cucaroach

tree.ValueHandler exposes raw machine type hooks that are used by
vec_handler to build coldata.Vec's.

Epic: CRDB-18892
Informs: #91831
Release note: None


96328: udf: allow strict UDF with no arguments r=DrewKimball a=DrewKimball

This patch fixes the case when a strict UDF (returns null on null input) has no arguments. Previously, attempting to call such a function would result in `ERROR: reflect: call of reflect.Value.Pointer on zero Value`.

Fixes #96326

Release note: None

96366: release: skip nil GitHub events r=celiala a=rail

Previously, we referenced `*event.Event`, but in some cases the event objects are `nil`.

This PR skips the nil GitHub event objects.

Epic: none
Release note: None

Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Xiang Gu <[email protected]>
Co-authored-by: Shivam Saraf <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Drew Kimball <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 1, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 2, 2023

Build succeeded:

@craig craig bot merged commit c77104e into cockroachdb:master Feb 2, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 2, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 2e8f273 to blathers/backport-release-22.2-95865: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

nicktrav added a commit to nicktrav/cockroach that referenced this pull request Feb 3, 2023
The `fuse` variant of the `disk-stalled` roachtest was skipped in
\cockroachdb#95865.

Re-enable the skipped variant, updating it to make use of our forked
version of `charybdefs`. This fork includes a patch that allows for
specifying a delay time for syscalls, making it possible to simulate a
complete disk stall. Previously, delay times were limited to 50ms, which
meant that the detection time had to be even lower (e.g. 40ms), which
was not representative of how Cockroach is configured in practice.

Fixes cockroachdb#95874.

Release note: None.
nicktrav added a commit to nicktrav/cockroach that referenced this pull request Feb 4, 2023
The `fuse` variant of the `disk-stalled` roachtest was skipped in
\cockroachdb#95865.

Re-enable the skipped variant, updating it to make use of our forked
version of `charybdefs`. This fork includes a patch that allows for
specifying a delay time for syscalls, making it possible to simulate a
complete disk stall. Previously, delay times were limited to 50ms, which
meant that the detection time had to be even lower (e.g. 40ms), which
was not representative of how Cockroach is configured in practice.

Allow the roachprod infrastructure to interpolate strings such as
`{store-dir}`, etc. when provided as `ExtraArgs` or the `KeyCmd`. Fix by
expanding expanding all args, rather than just `ExtraArgs`.

Fixes cockroachdb#95874.

Release note: None.
craig bot pushed a commit that referenced this pull request Feb 6, 2023
96094: roachtest: unskip fuse disk-stall roachtest variant  r=jbowens a=nicktrav

**roachtest: unskip fuse disk-stall roachtest variant**

The `fuse` variant of the `disk-stalled` roachtest was skipped in
#95865.

Reenable the skipped variant, updating it to make use of our forked
version of `charybdefs`. This fork includes a patch that allows for
specifying a delay time for syscalls, making it possible to simulate a
complete disk stall. Previously, delay times were limited to 50ms, which
meant that the detection time had to be even lower (e.g. 40ms), which
was not representative of how Cockroach is configured in practice.

Fixes #95874.

Fixes #95815.

Fixes #96410.

Fixes #95886.

Release note: None.

96130: pkg/compose: bootstrap CCL for TestComposeCompare r=ZhouXing19,rail a=srosenberg

The nightly TestComposeCompare integration test uses a randomized SQL workload (a dialect denoted by sqlsmith.PostgresMode) to compare the results against Postgres. Additionally, it runs a second instance of CockroachDB using a series of mutators, e.g., randgen.StatisticsMutator, intended to alter table statistics, indexes, etc., but not the actual table data. Thus, all three data sources are expected to return the same result, for a given SQL query, modulo ignored SQL errors.

Some of the generated SQL may require CCL as can be seen from a number of previous test failures. This change bootstraps an environment variable (COCKROACH_DEV_LICENSE), similarly to roachprod and roachtest. The environment variable is passed via CI; it's also in our dev. environment, including gceworker.
The test aborts in case COCKROACH_DEV_LICENSE is empty.

Epic: none

Informs: #82867

Release note: None

96378: builtins: check_consistency now requires admin role r=rafiss a=e-mbrown

Resolve #88224

`crdb_internal.check_consistency` is an expensive operation that previously could be invoked by regular users. Now admin role is required.

Release note: None

Co-authored-by: Nick Travers <[email protected]>
Co-authored-by: Stan Rosenberg <[email protected]>
Co-authored-by: e-mbrown <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Feb 6, 2023
The `fuse` variant of the `disk-stalled` roachtest was skipped in
\#95865.

Re-enable the skipped variant, updating it to make use of our forked
version of `charybdefs`. This fork includes a patch that allows for
specifying a delay time for syscalls, making it possible to simulate a
complete disk stall. Previously, delay times were limited to 50ms, which
meant that the detection time had to be even lower (e.g. 40ms), which
was not representative of how Cockroach is configured in practice.

Allow the roachprod infrastructure to interpolate strings such as
`{store-dir}`, etc. when provided as `ExtraArgs` or the `KeyCmd`. Fix by
expanding expanding all args, rather than just `ExtraArgs`.

Fixes #95874.

Release note: None.
nicktrav added a commit to nicktrav/cockroach that referenced this pull request Feb 10, 2023
The `fuse` variant of the `disk-stalled` roachtest was skipped in
\cockroachdb#95865.

Re-enable the skipped variant, updating it to make use of our forked
version of `charybdefs`. This fork includes a patch that allows for
specifying a delay time for syscalls, making it possible to simulate a
complete disk stall. Previously, delay times were limited to 50ms, which
meant that the detection time had to be even lower (e.g. 40ms), which
was not representative of how Cockroach is configured in practice.

Allow the roachprod infrastructure to interpolate strings such as
`{store-dir}`, etc. when provided as `ExtraArgs` or the `KeyCmd`. Fix by
expanding expanding all args, rather than just `ExtraArgs`.

Fixes cockroachdb#95874.

Release note: None.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants