Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: surface cluster config. params when filing an issue #80799

Closed
srosenberg opened this issue Apr 29, 2022 · 4 comments · Fixed by #81845
Closed

roachtest: surface cluster config. params when filing an issue #80799

srosenberg opened this issue Apr 29, 2022 · 4 comments · Fixed by #81845
Assignees
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@srosenberg
Copy link
Member

srosenberg commented Apr 29, 2022

A new failed roachtest will result in creating a fresh github issue. (If it's a repeat failure, then a new comment is added.)
The issue description doesn't capture essential information pertaining to the cluster configuration. This could lead to wasted (developer) cycles; e.g., see #68976 (comment)
Some tests randomly enable zfs, encryption, etc.–this context should also be propagated to the issue.

Each roachtest is associated with a TestSpec [1] which contains a ClusterSpec [2]. The latter can be used to extract the cloud provider, instance type, node count, etc., including whether or not zfs was used. (Determining whether or not encryption is enabled is a separate issue [3]; it needs some refactoring to lift it into ClusterSpec.)
The test runner's maybePostGithubIssue already receives TestSpec; the rest is a matter of choosing a compact while descriptive format to capture the additional (cluster) configuration parameters in order to make it easier for the reader.

[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/test/test_interface.go#L20
[2] https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/spec/cluster_spec.go#L41
[3] #79265 (comment)

Jira issue: CRDB-15509

@blathers-crl
Copy link

blathers-crl bot commented Apr 29, 2022

Hi @srosenberg, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@srosenberg srosenberg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-testing Testing tools and infrastructure labels Apr 29, 2022
@srosenberg srosenberg changed the title roachtest-filed issues don't point out which cloud the failure occurred on. They should. roachtest: surface cluster config. params when filing an issue Apr 29, 2022
@tbg
Copy link
Member

tbg commented May 5, 2022

Just a heads up, the cloud has been a part of the issue for a long time:

image

However, the "message" is often a lengthy log dump and gets truncated, so the cloud is lost:

image

A goal here should be to avoid "jamming" this metadata into the "message" code block. Instead, we should pass in a tags map[string]string (or something like that) which features prominently above the code block.

We basically have this already:

// GOFLAGS=-foo TAGS=-race etc.
Parameters []string

So maybe the task here is to put more parameters into roachtest issues, and to adjust the template to print these in the right place. (They are currently tucked away under the "Help" collapsible, so they're rarely seen).

While we touch the issue template, we should also fix #80852, which is a big paper cut that we pay a tax for every day.

@renatolabs
Copy link
Contributor

Some thoughts on this:

As @tbg mentioned, the cloud and branch names are already displayed in the message block, although that information is often lost when the message is truncated. It seems to me that even if the message was not truncated, we could use some more easily-accessible cluster data to help debug issues.

One possible approach here would be to just add the cluster spec to the Parameters field that @tbg referenced. We wouldn't even have to change the formatting necessarily, and it would look something like this:

Screen Shot 2022-05-16 at 10 35 22 AM

Some pros/cons:

  • Pro: no changes to existing data structures and formatting code.
  • Con: Parameters is a somewhat abstract term (currently it seems tied to just GOFLAGS and TAGS, likely because this was the immediate need when the field was introduced). It's not obvious that the cluster spec would be a "parameter". And a lot of things could be considered "parameters".
  • Con: parameters are not grouped, so we'd have infrastructure parameters (cloud name) alongside build flags.

Another option:

Screen Shot 2022-05-16 at 10 33 09 AM

Here, "Cluster" gets its own collapsed area. I also chose to use a table here to experiment with that formatting approach, which I find more pleasant than the current bulleted list of parameters.

  • Pro: cluster data is all in one place, and devs can see the configuration the test ran in in a quick glance.
  • Con: introduction of a new collapsible area in the issue report

In both cases, the actual list of cluster parameters included in the report would include other things (like ZFS, encryption, etc). The screenshots above are only to illustrate the aproaches.

I'm slightly more inclined towards the second option.


There are other options, of course, and I'm interested to hear other ways people would prefer to see this formatted.

I'm also aware that not everything in ClusterSpec necessarily needs to be included in the report as they are declared in the test itself and never changed; however, I think including them anyway is good as it reduces friction.

Thoughts?

@tbg
Copy link
Member

tbg commented May 17, 2022

I like option two as well, though I think including all parameters is too much; we typically don't care about most of them.

Also, worth considering rotated tables:

Machine type CPUs HDD Foo Bar
asdfasdasfasf 4 500 asfasf lkjlkjj

I also still don't dislike Option 1. I think with a bit of care it would get the job done just fine without having to put in too much additional work. For example, what if we list the params (in sorted order) like below. I think (a refined version of) that would be my preference, since it just gets the job done (?).


roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on master @ 16c484aa84d3718bfc82557f8e935ab78e6753b6

Parameters: TAGS=-endtoendenv GOFLAGS=-somegoflags ROACHTEST_Cloud=gce ROACHTEST_InstanceType=n1-standard-2 ROACHTEST_NodeCount=5 ROACHTEST_CPUs=2 ROACHTEST_SSDs=5 ROACHTEST_RAID0=true ROACHTEST_VolumeSize=100GB

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpccbench/nodes=9/cpu=4/multi-region/run_1
	monitor.go:127,tpcc.go:1096,tpcc.go:931,test_runner.go:876: monitor failure: monitor task failed: Non-zero exit code: 1
blah blah
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
renatolabs added a commit to renatolabs/cockroach that referenced this issue May 25, 2022
This commit changes the 'Parameters' section of a test
failure. Previously, it would only include the contents of the `TAGS`
and `GOFLAGS` environment variables. It now adds cluster configuration
values obtained from the `ClusterSpec`. These parameters are prefixed
with `ROACHTEST` (e.g., `ROACHTEST_cloud`).

For now, just the cloud name, number of CPUs and SSDs are included. In
the future, we can expand this work to support other relevant cluster
configuration values as needed to debug test failures.

Resolves: cockroachdb#80799.

Release note: None.
renatolabs added a commit to renatolabs/cockroach that referenced this issue May 30, 2022
This commit changes the 'Parameters' section of a test
failure. Previously, it would only include the contents of the `TAGS`
and `GOFLAGS` environment variables. It now adds cluster configuration
values obtained from the `ClusterSpec`. These parameters are prefixed
with `ROACHTEST` (e.g., `ROACHTEST_cloud`).

For now, just the cloud name, number of CPUs and SSDs are included. In
the future, we can expand this work to support other relevant cluster
configuration values as needed to debug test failures.

Resolves: cockroachdb#80799.

Release note: None.
renatolabs added a commit to renatolabs/cockroach that referenced this issue May 30, 2022
This commit changes the 'Parameters' section of a test
failure. Previously, it would only include the contents of the `TAGS`
and `GOFLAGS` environment variables. It now adds cluster configuration
values obtained from the `ClusterSpec`. These parameters are prefixed
with `ROACHTEST` (e.g., `ROACHTEST_cloud`).

For now, just the cloud name, number of CPUs and SSDs are included. In
the future, we can expand this work to support other relevant cluster
configuration values as needed to debug test failures.

Resolves: cockroachdb#80799.

Release note: None.
craig bot pushed a commit that referenced this issue Jun 6, 2022
81845: roachtest: include some cluster configs in test failure reports r=srosenberg a=renatolabs

This commit changes the 'Parameters' section of a test
failure. Previously, it would only include the contents of the `TAGS`
and `GOFLAGS` environment variables. It now adds cluster configuration
values obtained from the `ClusterSpec`. These parameters are prefixed
with `ROACHTEST` (e.g., `ROACHTEST_cloud`).

For now, just the cloud name, number of CPUs and SSDs are included. In
the future, we can expand this work to support other relevant cluster
configuration values as needed to debug test failures.

Resolves: #80799.

Release note: None.

81934: kvcoord: optimize Truncate r=yuzefovich a=yuzefovich

**kvcoord: introduce benchmark for Truncate function**

Release note: None

**kvcoord: optimize Truncate**

This commit optimizes the `Truncate` function by accessing the header of
the request directly (instead of having to construct `roachpb.Span`) as
well as keeping track of whether the request has been changed (instead
of performing key comparison later).
```
name                              old time/op    new time/op    delta
Truncate/reqs=32/type=get-24        3.81µs ± 4%    3.42µs ± 1%  -10.32%  (p=0.000 n=9+9)
Truncate/reqs=32/type=scan-24       6.35µs ±11%    5.81µs ±11%   -8.56%  (p=0.019 n=10+10)
Truncate/reqs=1024/type=get-24      98.9µs ± 2%    85.8µs ± 1%  -13.24%  (p=0.000 n=10+10)
Truncate/reqs=1024/type=scan-24      176µs ± 2%     155µs ± 2%  -12.17%  (p=0.000 n=10+10)
Truncate/reqs=32768/type=get-24     3.23ms ± 1%    2.78ms ± 0%  -13.86%  (p=0.000 n=10+9)
Truncate/reqs=32768/type=scan-24    5.64ms ± 1%    5.02ms ± 0%  -11.01%  (p=0.000 n=10+9)

name                              old alloc/op   new alloc/op   delta
Truncate/reqs=32/type=get-24          744B ± 0%      744B ± 0%     ~     (all equal)
Truncate/reqs=32/type=scan-24       2.08kB ±37%    2.05kB ±36%     ~     (p=0.897 n=10+10)
Truncate/reqs=1024/type=get-24      24.6kB ± 0%    24.6kB ± 0%     ~     (all equal)
Truncate/reqs=1024/type=scan-24     73.2kB ± 2%    73.7kB ± 2%     ~     (p=0.167 n=9+9)
Truncate/reqs=32768/type=get-24     1.21MB ± 0%    1.21MB ± 0%     ~     (p=0.370 n=10+10)
Truncate/reqs=32768/type=scan-24    2.82MB ± 0%    2.84MB ± 0%   +0.48%  (p=0.000 n=10+10)

name                              old allocs/op  new allocs/op  delta
Truncate/reqs=32/type=get-24          10.0 ± 0%      10.0 ± 0%     ~     (all equal)
Truncate/reqs=32/type=scan-24         33.6 ±25%      34.8 ±26%     ~     (p=0.646 n=10+10)
Truncate/reqs=1024/type=get-24        20.0 ± 0%      20.0 ± 0%     ~     (all equal)
Truncate/reqs=1024/type=scan-24        692 ± 5%       705 ± 6%     ~     (p=0.167 n=9+9)
Truncate/reqs=32768/type=get-24       40.0 ± 0%      40.0 ± 0%     ~     (all equal)
Truncate/reqs=32768/type=scan-24     21.6k ± 2%     22.0k ± 1%   +1.73%  (p=0.000 n=10+10)
```

Addresses: #68536.

Release note: None

81938: kvcoord: don't construct roachpb.Spans needlessly in several spots r=yuzefovich a=yuzefovich

This should be a minor performance improvement, but I don't know of good
benchmarks to verify that, yet the change doesn't seem controversial
either.

Release note: None

Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in 710c879 Jun 6, 2022
smg260 pushed a commit to smg260/cockroach that referenced this issue Jan 17, 2023
This commit changes the 'Parameters' section of a test
failure. Previously, it would only include the contents of the `TAGS`
and `GOFLAGS` environment variables. It now adds cluster configuration
values obtained from the `ClusterSpec`. These parameters are prefixed
with `ROACHTEST` (e.g., `ROACHTEST_cloud`).

For now, just the cloud name, number of CPUs and SSDs are included. In
the future, we can expand this work to support other relevant cluster
configuration values as needed to debug test failures.

Resolves: cockroachdb#80799.

Release note: None.
smg260 pushed a commit to smg260/cockroach that referenced this issue Jan 18, 2023
This commit changes the 'Parameters' section of a test
failure. Previously, it would only include the contents of the `TAGS`
and `GOFLAGS` environment variables. It now adds cluster configuration
values obtained from the `ClusterSpec`. These parameters are prefixed
with `ROACHTEST` (e.g., `ROACHTEST_cloud`).

For now, just the cloud name, number of CPUs and SSDs are included. In
the future, we can expand this work to support other relevant cluster
configuration values as needed to debug test failures.

Resolves: cockroachdb#80799.

Release note: None.
smg260 pushed a commit to smg260/cockroach that referenced this issue Jan 18, 2023
This commit changes the 'Parameters' section of a test
failure. Previously, it would only include the contents of the `TAGS`
and `GOFLAGS` environment variables. It now adds cluster configuration
values obtained from the `ClusterSpec`. These parameters are prefixed
with `ROACHTEST` (e.g., `ROACHTEST_cloud`).

For now, just the cloud name, number of CPUs and SSDs are included. In
the future, we can expand this work to support other relevant cluster
configuration values as needed to debug test failures.

Resolves: cockroachdb#80799.

Release note: None.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants