roachtest: rewrite the test runner #30977

andreimatei · 2018-10-04T19:21:04Z

This patch reimplements the roachtest test runner. Some issues with the
previous version:

roachtest --paralleism is ignored when running a single test
repeated runs of that test don't reuse clusters
cluster reuse is achieved through an inflexible subtest mechanism
there's no way to control the quota used on GCE. There's only a way to
control the number of tests that run in parallel

This patch takes a clean-slate approach to the runner. It introduces a
runner that takes a --cpu-quota (#cloud cpus) and then, building on the
cluster reuse policies introduced in the previous commit, tries to
schedule tests most efficiently and with the greatest parallelism while
staying under that quota (and also while staying under the
--parallelism, which now acts purely to protect against too many tests
mthrashing the local machine).
The runner starts --parallelism workers and lets them compete on a
ResourceGovernor which manages the cloud quota.
Tests are seen abstractly as units of work. There's no more subtests,
and --count is supported naturally as more work.
The scheduling policy is not advanced: at any point in time, a worker
has a cluster and will continuously select tests that can reuse that
cluster. When no more tests can reuse it, the cluster is destroyed, new
resources acquired and a new cluster created. Within multiple tests that
can all reuse a cluster, ones that don't soil the cluster are preferred.
Within otherwise equal tests, the ones with more runs remaining are
preferred.

Release note: None

andreimatei · 2018-10-04T19:21:55Z

The first commit is #30146.

Peter, this patch is nowhere ready, but I'd like a nod on the direction pls.

cockroach-teamcity · 2018-10-04T22:15:51Z

This change is

petermattis

Before this PR, --parallelism was explicitly limited to the number of tests by this code:

	// Limit the parallelism to the number of tests. The primary effect this has
	// is that we'll log to stdout/stderr if only one test is being run.
	if parallelism > len(tests) {
		parallelism = len(tests)
	}

Perhaps we should just remove that code once your PR to place each run's output in a separate directory is merged. If you want to run a single test multiple times sequentially you can already do --count 100 --parallelism 1.

Reviewed 4 of 4 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained

andreimatei

Yes, that code is going away.
I'm not sure what your comment says about the current PR though (if anything). Are you suggesting this PR is not necessary? (cause this PR is all about cluster reuse)

Reviewable status: complete! 0 of 0 LGTMs obtained

andreimatei

I meant to ask - how does one use/test the roachtest run --cluster option? Is there an easy way to create a cluster that matches a particular test's expectation? Am I supposed to cause a test to fail and use the --debug option to leave a cluster around?
Also, is the roachtest invocation supposed to take ownership of such a cluster? Like, is it supposed to destroy it at the end?

Reviewable status: complete! 0 of 0 LGTMs obtained

andreimatei

Disregard the last comment; I meant to ask on another PR :P

Reviewable status: complete! 0 of 0 LGTMs obtained

petermattis

I'm not sure what your comment says about the current PR though (if anything). Are you suggesting this PR is not necessary? (cause this PR is all about cluster reuse)

My comment was in reference to the PR title which only talks about parallelism for repeated runs of the same test. I failed to notice that you're also reusing clusters (yes, this is mentioned in the commit message which I read too fast). Let me take a closer look.

Reviewable status: complete! 0 of 0 LGTMs obtained

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/test.go, line 637 at r2 (raw file):

	}
	sem := make(chan struct{}, parallelism)
	clustersPerTLT := parallelism / n

What happens if parallelism == 1 and n == 2? Seems like clustersPerTLT will be 0. Isn't that problematic, or is that what the adjustment below is trying to handle.

pkg/cmd/roachtest/test.go, line 643 at r2 (raw file):

		numClusters := clustersPerTLT
		if i < parallelism%n {
			numClusters++

I don't understand what this is doing.

pkg/cmd/roachtest/test.go, line 715 at r2 (raw file):

				artifactsDir: artifactsDir,
			}
			f.Put(newCluster(ctx, t, spec.Nodes))

Won't multiple concurrent cluster creation collide on both the cluster name and the artifacts dir?

andreimatei

You don't seem to be against anything, so I got what I wanted for now.
Will ping when it's ready.

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/test.go, line 637 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What happens if parallelism == 1 and n == 2? Seems like clustersPerTLT will be 0. Isn't that problematic, or is that what the adjustment below is trying to handle.

It's not supposed to be problematic. The different runTopLevelTest invocations will contend on sem for creating new clusters.

pkg/cmd/roachtest/test.go, line 643 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I don't understand what this is doing.

it's giving an extra cluster to some tests if the division or parallelism to tests had a reminder. So in your example (parallelism == 1 and n == 2), the first test would get a guaranteed cluster, and the second one would wait until the first one does all its runs.

pkg/cmd/roachtest/test.go, line 715 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Won't multiple concurrent cluster creation collide on both the cluster name and the artifacts dir?

yes (except I think you wanted to say "cluster name and log file"; the artifacts dir isn't an issue any more, as that's now a property of a test, not of a cluster ).

I told you it wasn't ready :P. I'll keep you posted.

petermattis

You don't seem to be against anything, so I got what I wanted for now.
Will ping when it's ready.

Ack. The assignment of parallelism to clusters is a bit more subtle than desirable, though I suppose the same could be said of much of the roachtest code. The channel of channels is interesting. I don't have a better suggestion.

Reviewable status: complete! 0 of 0 LGTMs obtained

andreimatei · 2018-10-16T20:59:38Z

@benesch I'd appreciate a high-level look over this. There's many i's to dot still and I haven't really run the thing, but I think it's ready for a first look.

petermattis

I've only skimmed parts of this and I likely won't have time to give it a thorough look until Friday. I suggest we hold off on merging this until 2.1 is out the door as I expect it to cause some short term instability. Or you have to find a way to break it into smaller pieces.

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/main.go, line 196 at r5 (raw file):

		cmd.Flags().IntVar(
			&cpuQuota, "cpu-quota", 100,
			"The number of cloud CPUs roachtest is allowed to use at any one time.")

What happens if a single test wants to use more CPU?

pkg/cmd/roachtest/test.go, line 90 at r5 (raw file):

// Any means that the cluster can be used by any other test.
var Any = ClusterReusePolicy{policy: any}

My intuition is that we should somehow fold this into the nodeSpec. The benefit of your approach is that you can upgrade an Any cluster to a tagged cluster, but I'm not sure that is worth supporting.

The nodeSpec approach fits more seamlessly into a world where "nobarrier" and "zfs" are also part of the nodeSpec.nodes(4, tag("acceptance")) could be used to specify a cluster that is shared by the acceptance tests (they would all have this same specification). nodes(6, noreuse) could be used to indicate a cluster that is not reused. I haven't looked at how you're checking to see if two cluster specs are compatible, but this seems to fit in naturally. nodes(4, nobarrier) could indicate that the disks are to be remounted with the "nobarrier" option. You could even imagine sharing a nodes(4) and nodes(4, nobarrier) cluster by remounting the disks automatically before starting the test. I'm not suggesting you do that work in this PR, just that the nodeSpec mechanism seems like the right place to be specifying cluster reuse.

pkg/cmd/roachtest/test.go, line 911 at r5 (raw file):

	for i := 0; i < parallelism; i++ {
		wg.Add(1)
		stopper.RunWorker(ctx, func(ctx context.Context) {

I'm not seeing what a stopper is adding here. Why not just go func() { ... } and a channel which is closed to indicated that runWorker should exit?

pkg/cmd/roachtest/test.go, line 974 at r5 (raw file):

					name = fmt.Sprintf("%d", timeutil.Now().Unix())
				}
				name += "-" + t.Name

Including the test name in the cluster name is going to be weird once clusters are being reused across unassociated tests. If the cluster is tagged, I think the tag should be included instead. I'm not sure what to do about untagged clusters.

tbg · 2019-07-02T08:56:14Z

nit: print the artifacts dir when a test fails

$ ./bin/roachtest run -u tobias --debug --count=16 restart/down-for
HTTP server listening on all network interfaces, port 8080.
test runner logs in: artifacts/test_runner-1562055398.log
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
--- FAIL: restart/down-for-2m 1070.98s
        restart.go:62,restart.go:78,test_runner.go:659: pq: no inbound stream connection

--- PASS: restart/down-for-2m 1106.76s
=== RUN   restart/down-for-2m
--- PASS: restart/down-for-2m 1107.26s
=== RUN   restart/down-for-2m
--- PASS: restart/down-for-2m 1122.41s
=== RUN   restart/down-for-2m
--- PASS: restart/down-for-2m 1090.67s
=== RUN   restart/down-for-2m
--- PASS: restart/down-for-2m 1111.64s

I basically manually have to go find the directory (the test runner log file is similarly opaque).

andreimatei

print the artifacts dir when a test fails

done

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bobvawter, @nvanbenschoten, @petermattis, and @tbg)

CLAassistant · 2019-07-02T19:38:26Z

All committers have signed the CLA.

This patch adds facilities from taking quota out of a pool - i.e. the client informing the pool that it's never getting back quota that it handed out and it should deal with it. Taking out quota is done through a new IntAlloc.Freeze() method. The effects are that ongoing Alloc() requests are trimmed to the new capacity, and ongoing AllocFunc() requests are woken up and given a more expansive interface into the pool for figuring out the current capacity. Additionally, the closure passed to AllocFunc() can now return an error to interrupt the respective request. This is useful for roachtest - when a cluster is "saved for debugging", its cpu quota is effectively leaked. This PR gives us a way for roachtest to then realize that there's not enough quota left for other tests. Release note: None

The point of subtests was that all the subtests under a root share a cluster. This patch introduces more flexible policies for sharing clusters across tests (only the declaration of these policies is here; the implementation is in future commits). The policies replace the need for subtests, so they go away, simplifying the code considerably. Release note: None

This patch reimplements the roachtest test runner. Some issues with the previous version: - roachtest --paralleism is ignored when running a single test - repeated runs of that test don't reuse clusters - cluster reuse is achieved through an inflexible subtest mechanism - there's no way to control the quota used on GCE. There's only a way to control the number of tests that run in parallel This patch takes a clean-slate approach to the runner. It introduces a runner that takes a --cpu-quota (#cloud cpus) and then, building on the cluster reuse policies introduced in the previous commit, tries to schedule tests most efficiently and with the greatest parallelism while staying under that quota (and also while staying under the --parallelism, which now acts purely to protect against too many tests mthrashing the local machine). The runner starts --parallelism workers and lets them compete on a QuotaPool which manages the cloud quota. Tests are seen abstractly as units of work. There's no more subtests, and --count is supported naturally as more work. The scheduling policy is not advanced: at any point in time, a worker has a cluster and will continuously select tests that can reuse that cluster. When no more tests can reuse it, the cluster is destroyed, new resources acquired and a new cluster created. Within multiple tests that can all reuse a cluster, ones that don't soil the cluster are preferred. Within otherwise equal tests, the ones with more runs remaining are preferred. Release note: None

andreimatei

bors r+

:S

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @bobvawter, @nvanbenschoten, @petermattis, and @tbg)

30977: roachtest: rewrite the test runner r=andreimatei a=andreimatei This patch reimplements the roachtest test runner. Some issues with the previous version: - roachtest --paralleism is ignored when running a single test - repeated runs of that test don't reuse clusters - cluster reuse is achieved through an inflexible subtest mechanism - there's no way to control the quota used on GCE. There's only a way to control the number of tests that run in parallel This patch takes a clean-slate approach to the runner. It introduces a runner that takes a --cpu-quota (#cloud cpus) and then, building on the cluster reuse policies introduced in the previous commit, tries to schedule tests most efficiently and with the greatest parallelism while staying under that quota (and also while staying under the --parallelism, which now acts purely to protect against too many tests mthrashing the local machine). The runner starts --parallelism workers and lets them compete on a ResourceGovernor which manages the cloud quota. Tests are seen abstractly as units of work. There's no more subtests, and --count is supported naturally as more work. The scheduling policy is not advanced: at any point in time, a worker has a cluster and will continuously select tests that can reuse that cluster. When no more tests can reuse it, the cluster is destroyed, new resources acquired and a new cluster created. Within multiple tests that can all reuse a cluster, ones that don't soil the cluster are preferred. Within otherwise equal tests, the ones with more runs remaining are preferred. Release note: None Co-authored-by: Andrei Matei <[email protected]>

craig · 2019-07-02T21:39:43Z

Build succeeded

GitHub CI (Cockroach)

Before the roachtest rewrite in cockroachdb#30977 we used to collect the logs directory from remote hosts unconditionally. We also used to write stats.json histogram files to the logs directory. This combination enabled the collection of histogram files for roachperf. Since the new roachtest does not unconditionally copy logs to the test runner, we have not gotten any perf artifacts since the change. This PR introduces a new concept, perf artifacts, which live in a different directory, the perfArtifactsDir. If a testSpec indicates that it HasPerfArtifacts then that directory is copied to the test runner's artifactsDir for that test from each node in the cluster upon success. This change also necessitates a small change to roachperf to look in this new "perf" directory. Fixes cockroachdb#38871. Release note: None

@andreimatei

Before the roachtest rewrite in cockroachdb#30977 we used to collect the logs directory from remote hosts unconditionally. We also used to write stats.json histogram files to the logs directory. This combination enabled the collection of histogram files for roachperf. Since the new roachtest only copies logs to the test runner in the case of failure, we have not gotten any perf artifacts since the change. This log policy is sane for logs and other debug artifacts like heap profiles that exist there but has been a bit of a bummer as we've missed out on the dopamine rush due to seeing charts go up and to the right in bespoke D3 glory. The truth is that we weren't all that excited about the previous behavior with regards to perf data either. Sometimes tests would fail and their lousy results would mess with our pretty pictures (https://github.com/cockroachdb/roachperf/issues/37). This PR introduces a new concept, perf artifacts, which live in a different directory, the perfArtifactsDir ("perf") and has the opposite fetching policy from the logs; the "perf" directory is copied from each remove host into the test's artifactsDir only upon successful completion of the test. One downside of this as implemented is that it can spam the test's log (but fortunately not roachtest's logs anymore, <3 @andreimatei). cockroachdb#38890 attempts to address this somewhat. It shouldn't been too big of a problem because it only happens on tests that already passed. This change also necessitates a small change to roachperf to look in this new "perf" directory (and respect the rewrite's test artifact directory structure). Fixes cockroachdb#38871. Release note: None

38848: roachpb: further clarify correctness of observed timestamps r=nvanbenschoten a=nvanbenschoten This commit expands on the explanation added in b7cc205. Specifically, it focuses on the inability to bound a transaction's uncertainty interval below the starting timestamp of the lease that it is executing a request on. This limitation was added in 85d4627. I tend to forget about it once every few months and get very confused how this all works. Release note: None 38889: roachtest: create perf artifacts concept and collect them r=nvanbenschoten a=ajwerner Before the roachtest rewrite in #30977 we used to collect the logs directory from remote hosts unconditionally. We also used to write stats.json histogram files to the logs directory. This combination enabled the collection of histogram files for roachperf. Since the new roachtest does not unconditionally copy logs to the test runner, we have not gotten any perf artifacts since the change. This PR introduces a new concept, perf artifacts, which live in a different directory, the perfArtifactsDir. If a testSpec indicates that it HasPerfArtifacts then that directory is copied to the test runner's artifactsDir for that test from each node in the cluster upon success. This change also necessitates a small change to roachperf to look in this new "perf" directory. Fixes #38871. Release note: None 38895: roachprod: disable ssh logging r=ajwerner a=tbg This was never fully introduced and there is currently no impetus to push it over the finish line. It would be useful to diagnose issues that are readily reproducible, but we haven't seen any of those. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>

andreimatei assigned petermattis Oct 4, 2018

andreimatei force-pushed the roachtest-parallel branch 2 times, most recently from 0f27000 to 77ac0ec Compare October 5, 2018 13:48

petermattis reviewed Oct 5, 2018

View reviewed changes

andreimatei commented Oct 5, 2018

View reviewed changes

petermattis reviewed Oct 5, 2018

View reviewed changes

petermattis requested changes Oct 5, 2018

View reviewed changes

andreimatei force-pushed the roachtest-parallel branch from 77ac0ec to fd0503c Compare October 5, 2018 18:39

andreimatei commented Oct 5, 2018

View reviewed changes

petermattis reviewed Oct 5, 2018

View reviewed changes

andreimatei force-pushed the roachtest-parallel branch 4 times, most recently from 929a58e to 8a35cca Compare October 14, 2018 03:51

andreimatei force-pushed the roachtest-parallel branch 2 times, most recently from b0e515a to 0d9dfb5 Compare October 16, 2018 20:32

andreimatei changed the title ~~roachtest: allow parallel execution of repeated runs~~ roachtest: rewrite the test runner Oct 16, 2018

andreimatei force-pushed the roachtest-parallel branch from 0d9dfb5 to bd1f3d1 Compare October 16, 2018 20:58

petermattis reviewed Oct 17, 2018

View reviewed changes

andreimatei force-pushed the roachtest-parallel branch 2 times, most recently from 37e5135 to a65b664 Compare October 18, 2018 21:48

andreimatei requested a review from a team October 18, 2018 21:48

andreimatei force-pushed the roachtest-parallel branch 3 times, most recently from ada1dfc to 26c161f Compare October 19, 2018 02:27

tbg mentioned this pull request Jun 26, 2019

storage: add to perf comment in handleRaftReady #38380

Merged

andreimatei force-pushed the roachtest-parallel branch 7 times, most recently from c507175 to 11841d3 Compare July 1, 2019 23:44

andreimatei commented Jul 2, 2019

View reviewed changes

andreimatei added 2 commits July 2, 2019 16:04

andreimatei force-pushed the roachtest-parallel branch 4 times, most recently from 92f7911 to e83078f Compare July 2, 2019 20:38

andreimatei force-pushed the roachtest-parallel branch from e83078f to 0a776f4 Compare July 2, 2019 20:56

andreimatei commented Jul 2, 2019

View reviewed changes

craig bot merged commit 0a776f4 into cockroachdb:master Jul 2, 2019

andreimatei deleted the roachtest-parallel branch July 8, 2019 14:16

ajwerner mentioned this pull request Jul 15, 2019

roachtest: create perf artifacts concept and collect them #38889

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: rewrite the test runner #30977

roachtest: rewrite the test runner #30977

andreimatei commented Oct 4, 2018 •

edited

Loading

andreimatei commented Oct 4, 2018

cockroach-teamcity commented Oct 4, 2018

petermattis left a comment

andreimatei left a comment

andreimatei left a comment

andreimatei left a comment

petermattis left a comment

petermattis left a comment

andreimatei left a comment

petermattis left a comment

andreimatei commented Oct 16, 2018

petermattis left a comment

tbg commented Jul 2, 2019

andreimatei left a comment

CLAassistant commented Jul 2, 2019 •

edited

Loading

andreimatei left a comment

craig bot commented Jul 2, 2019

roachtest: rewrite the test runner #30977

roachtest: rewrite the test runner #30977

Conversation

andreimatei commented Oct 4, 2018 • edited Loading

andreimatei commented Oct 4, 2018

cockroach-teamcity commented Oct 4, 2018

petermattis left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

andreimatei commented Oct 16, 2018

petermattis left a comment

Choose a reason for hiding this comment

tbg commented Jul 2, 2019

andreimatei left a comment

Choose a reason for hiding this comment

CLAassistant commented Jul 2, 2019 • edited Loading

andreimatei left a comment

Choose a reason for hiding this comment

craig bot commented Jul 2, 2019

Build succeeded

andreimatei commented Oct 4, 2018 •

edited

Loading

CLAassistant commented Jul 2, 2019 •

edited

Loading