Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: rewrite the test runner #30977

Merged
merged 3 commits into from
Jul 2, 2019

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Oct 4, 2018

This patch reimplements the roachtest test runner. Some issues with the
previous version:

  • roachtest --paralleism is ignored when running a single test
  • repeated runs of that test don't reuse clusters
  • cluster reuse is achieved through an inflexible subtest mechanism
  • there's no way to control the quota used on GCE. There's only a way to
    control the number of tests that run in parallel

This patch takes a clean-slate approach to the runner. It introduces a
runner that takes a --cpu-quota (#cloud cpus) and then, building on the
cluster reuse policies introduced in the previous commit, tries to
schedule tests most efficiently and with the greatest parallelism while
staying under that quota (and also while staying under the
--parallelism, which now acts purely to protect against too many tests
mthrashing the local machine).
The runner starts --parallelism workers and lets them compete on a
ResourceGovernor which manages the cloud quota.
Tests are seen abstractly as units of work. There's no more subtests,
and --count is supported naturally as more work.
The scheduling policy is not advanced: at any point in time, a worker
has a cluster and will continuously select tests that can reuse that
cluster. When no more tests can reuse it, the cluster is destroyed, new
resources acquired and a new cluster created. Within multiple tests that
can all reuse a cluster, ones that don't soil the cluster are preferred.
Within otherwise equal tests, the ones with more runs remaining are
preferred.

Release note: None

@andreimatei
Copy link
Contributor Author

The first commit is #30146.

Peter, this patch is nowhere ready, but I'd like a nod on the direction pls.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei andreimatei force-pushed the roachtest-parallel branch 2 times, most recently from 0f27000 to 77ac0ec Compare October 5, 2018 13:48
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, --parallelism was explicitly limited to the number of tests by this code:

	// Limit the parallelism to the number of tests. The primary effect this has
	// is that we'll log to stdout/stderr if only one test is being run.
	if parallelism > len(tests) {
		parallelism = len(tests)
	}

Perhaps we should just remove that code once your PR to place each run's output in a separate directory is merged. If you want to run a single test multiple times sequentially you can already do --count 100 --parallelism 1.

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that code is going away.
I'm not sure what your comment says about the current PR though (if anything). Are you suggesting this PR is not necessary? (cause this PR is all about cluster reuse)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to ask - how does one use/test the roachtest run --cluster option? Is there an easy way to create a cluster that matches a particular test's expectation? Am I supposed to cause a test to fail and use the --debug option to leave a cluster around?
Also, is the roachtest invocation supposed to take ownership of such a cluster? Like, is it supposed to destroy it at the end?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disregard the last comment; I meant to ask on another PR :P

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what your comment says about the current PR though (if anything). Are you suggesting this PR is not necessary? (cause this PR is all about cluster reuse)

My comment was in reference to the PR title which only talks about parallelism for repeated runs of the same test. I failed to notice that you're also reusing clusters (yes, this is mentioned in the commit message which I read too fast). Let me take a closer look.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/test.go, line 637 at r2 (raw file):

	}
	sem := make(chan struct{}, parallelism)
	clustersPerTLT := parallelism / n

What happens if parallelism == 1 and n == 2? Seems like clustersPerTLT will be 0. Isn't that problematic, or is that what the adjustment below is trying to handle.


pkg/cmd/roachtest/test.go, line 643 at r2 (raw file):

		numClusters := clustersPerTLT
		if i < parallelism%n {
			numClusters++

I don't understand what this is doing.


pkg/cmd/roachtest/test.go, line 715 at r2 (raw file):

				artifactsDir: artifactsDir,
			}
			f.Put(newCluster(ctx, t, spec.Nodes))

Won't multiple concurrent cluster creation collide on both the cluster name and the artifacts dir?

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't seem to be against anything, so I got what I wanted for now.
Will ping when it's ready.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/test.go, line 637 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What happens if parallelism == 1 and n == 2? Seems like clustersPerTLT will be 0. Isn't that problematic, or is that what the adjustment below is trying to handle.

It's not supposed to be problematic. The different runTopLevelTest invocations will contend on sem for creating new clusters.


pkg/cmd/roachtest/test.go, line 643 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I don't understand what this is doing.

it's giving an extra cluster to some tests if the division or parallelism to tests had a reminder. So in your example (parallelism == 1 and n == 2), the first test would get a guaranteed cluster, and the second one would wait until the first one does all its runs.


pkg/cmd/roachtest/test.go, line 715 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Won't multiple concurrent cluster creation collide on both the cluster name and the artifacts dir?

yes (except I think you wanted to say "cluster name and log file"; the artifacts dir isn't an issue any more, as that's now a property of a test, not of a cluster ).

I told you it wasn't ready :P. I'll keep you posted.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't seem to be against anything, so I got what I wanted for now.
Will ping when it's ready.

Ack. The assignment of parallelism to clusters is a bit more subtle than desirable, though I suppose the same could be said of much of the roachtest code. The channel of channels is interesting. I don't have a better suggestion.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@andreimatei andreimatei force-pushed the roachtest-parallel branch 4 times, most recently from 929a58e to 8a35cca Compare October 14, 2018 03:51
@andreimatei andreimatei force-pushed the roachtest-parallel branch 2 times, most recently from b0e515a to 0d9dfb5 Compare October 16, 2018 20:32
@andreimatei andreimatei changed the title roachtest: allow parallel execution of repeated runs roachtest: rewrite the test runner Oct 16, 2018
@andreimatei
Copy link
Contributor Author

@benesch I'd appreciate a high-level look over this. There's many i's to dot still and I haven't really run the thing, but I think it's ready for a first look.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only skimmed parts of this and I likely won't have time to give it a thorough look until Friday. I suggest we hold off on merging this until 2.1 is out the door as I expect it to cause some short term instability. Or you have to find a way to break it into smaller pieces.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/main.go, line 196 at r5 (raw file):

		cmd.Flags().IntVar(
			&cpuQuota, "cpu-quota", 100,
			"The number of cloud CPUs roachtest is allowed to use at any one time.")

What happens if a single test wants to use more CPU?


pkg/cmd/roachtest/test.go, line 90 at r5 (raw file):

// Any means that the cluster can be used by any other test.
var Any = ClusterReusePolicy{policy: any}

My intuition is that we should somehow fold this into the nodeSpec. The benefit of your approach is that you can upgrade an Any cluster to a tagged cluster, but I'm not sure that is worth supporting.

The nodeSpec approach fits more seamlessly into a world where "nobarrier" and "zfs" are also part of the nodeSpec.nodes(4, tag("acceptance")) could be used to specify a cluster that is shared by the acceptance tests (they would all have this same specification). nodes(6, noreuse) could be used to indicate a cluster that is not reused. I haven't looked at how you're checking to see if two cluster specs are compatible, but this seems to fit in naturally. nodes(4, nobarrier) could indicate that the disks are to be remounted with the "nobarrier" option. You could even imagine sharing a nodes(4) and nodes(4, nobarrier) cluster by remounting the disks automatically before starting the test. I'm not suggesting you do that work in this PR, just that the nodeSpec mechanism seems like the right place to be specifying cluster reuse.


pkg/cmd/roachtest/test.go, line 911 at r5 (raw file):

	for i := 0; i < parallelism; i++ {
		wg.Add(1)
		stopper.RunWorker(ctx, func(ctx context.Context) {

I'm not seeing what a stopper is adding here. Why not just go func() { ... } and a channel which is closed to indicated that runWorker should exit?


pkg/cmd/roachtest/test.go, line 974 at r5 (raw file):

					name = fmt.Sprintf("%d", timeutil.Now().Unix())
				}
				name += "-" + t.Name

Including the test name in the cluster name is going to be weird once clusters are being reused across unassociated tests. If the cluster is tagged, I think the tag should be included instead. I'm not sure what to do about untagged clusters.

@andreimatei andreimatei force-pushed the roachtest-parallel branch 2 times, most recently from 37e5135 to a65b664 Compare October 18, 2018 21:48
@andreimatei andreimatei requested a review from a team October 18, 2018 21:48
@andreimatei andreimatei force-pushed the roachtest-parallel branch 3 times, most recently from ada1dfc to 26c161f Compare October 19, 2018 02:27
@andreimatei andreimatei force-pushed the roachtest-parallel branch 7 times, most recently from c507175 to 11841d3 Compare July 1, 2019 23:44
@tbg
Copy link
Member

tbg commented Jul 2, 2019

nit: print the artifacts dir when a test fails

$ ./bin/roachtest run -u tobias --debug --count=16 restart/down-for
HTTP server listening on all network interfaces, port 8080.
test runner logs in: artifacts/test_runner-1562055398.log
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
=== RUN   restart/down-for-2m
--- FAIL: restart/down-for-2m 1070.98s
        restart.go:62,restart.go:78,test_runner.go:659: pq: no inbound stream connection

--- PASS: restart/down-for-2m 1106.76s
=== RUN   restart/down-for-2m
--- PASS: restart/down-for-2m 1107.26s
=== RUN   restart/down-for-2m
--- PASS: restart/down-for-2m 1122.41s
=== RUN   restart/down-for-2m
--- PASS: restart/down-for-2m 1090.67s
=== RUN   restart/down-for-2m
--- PASS: restart/down-for-2m 1111.64s

I basically manually have to go find the directory (the test runner log file is similarly opaque).

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print the artifacts dir when a test fails

done

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bobvawter, @nvanbenschoten, @petermattis, and @tbg)

@CLAassistant
Copy link

CLAassistant commented Jul 2, 2019

CLA assistant check
All committers have signed the CLA.

This patch adds facilities from taking quota out of a pool - i.e. the
client informing the pool that it's never getting back quota that it
handed out and it should deal with it.
Taking out quota is done through a new IntAlloc.Freeze() method. The
effects are that ongoing Alloc() requests are trimmed to the new
capacity, and ongoing AllocFunc() requests are woken up and given a more
expansive interface into the pool for figuring out the current capacity.
Additionally, the closure passed to AllocFunc() can now return an error
to interrupt the respective request.

This is useful for roachtest - when a cluster is "saved for debugging",
its cpu quota is effectively leaked. This PR gives us a way for
roachtest to then realize that there's not enough quota left for other
tests.

Release note: None
The point of subtests was that all the subtests under a root share a
cluster. This patch introduces more flexible policies for sharing
clusters across tests (only the declaration of these policies is here;
the implementation is in future commits). The policies replace the need
for subtests, so they go away, simplifying the code considerably.

Release note: None
@andreimatei andreimatei force-pushed the roachtest-parallel branch 4 times, most recently from 92f7911 to e83078f Compare July 2, 2019 20:38
This patch reimplements the roachtest test runner. Some issues with the
previous version:
- roachtest --paralleism is ignored when running a single test
- repeated runs of that test don't reuse clusters
- cluster reuse is achieved through an inflexible subtest mechanism
- there's no way to control the quota used on GCE. There's only a way to
control the number of tests that run in parallel

This patch takes a clean-slate approach to the runner. It introduces a
runner that takes a --cpu-quota (#cloud cpus) and then, building on the
cluster reuse policies introduced in the previous commit, tries to
schedule tests most efficiently and with the greatest parallelism while
staying under that quota (and also while staying under the
--parallelism, which now acts purely to protect against too many tests
mthrashing the local machine).
The runner starts --parallelism workers and lets them compete on a
QuotaPool which manages the cloud quota.
Tests are seen abstractly as units of work. There's no more subtests,
and --count is supported naturally as more work.
The scheduling policy is not advanced: at any point in time, a worker
has a cluster and will continuously select tests that can reuse that
cluster. When no more tests can reuse it, the cluster is destroyed, new
resources acquired and a new cluster created. Within multiple tests that
can all reuse a cluster, ones that don't soil the cluster are preferred.
Within otherwise equal tests, the ones with more runs remaining are
preferred.

Release note: None
@andreimatei andreimatei force-pushed the roachtest-parallel branch from e83078f to 0a776f4 Compare July 2, 2019 20:56
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

:S

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @bobvawter, @nvanbenschoten, @petermattis, and @tbg)

craig bot pushed a commit that referenced this pull request Jul 2, 2019
30977: roachtest: rewrite the test runner r=andreimatei a=andreimatei

This patch reimplements the roachtest test runner. Some issues with the
previous version:
- roachtest --paralleism is ignored when running a single test
- repeated runs of that test don't reuse clusters
- cluster reuse is achieved through an inflexible subtest mechanism
- there's no way to control the quota used on GCE. There's only a way to
control the number of tests that run in parallel

This patch takes a clean-slate approach to the runner. It introduces a
runner that takes a --cpu-quota (#cloud cpus) and then, building on the
cluster reuse policies introduced in the previous commit, tries to
schedule tests most efficiently and with the greatest parallelism while
staying under that quota (and also while staying under the
--parallelism, which now acts purely to protect against too many tests
mthrashing the local machine).
The runner starts --parallelism workers and lets them compete on a
ResourceGovernor which manages the cloud quota.
Tests are seen abstractly as units of work. There's no more subtests,
and --count is supported naturally as more work.
The scheduling policy is not advanced: at any point in time, a worker
has a cluster and will continuously select tests that can reuse that
cluster. When no more tests can reuse it, the cluster is destroyed, new
resources acquired and a new cluster created. Within multiple tests that
can all reuse a cluster, ones that don't soil the cluster are preferred.
Within otherwise equal tests, the ones with more runs remaining are
preferred.

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jul 2, 2019

Build succeeded

@craig craig bot merged commit 0a776f4 into cockroachdb:master Jul 2, 2019
@andreimatei andreimatei deleted the roachtest-parallel branch July 8, 2019 14:16
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Jul 15, 2019
Before the roachtest rewrite in cockroachdb#30977 we used to collect the logs directory
from remote hosts unconditionally. We also used to write stats.json histogram
files to the logs directory. This combination enabled the collection of
histogram files for roachperf. Since the new roachtest does not unconditionally
copy logs to the test runner, we have not gotten any perf artifacts since the
change.

This PR introduces a new concept, perf artifacts, which live in a different
directory, the perfArtifactsDir. If a testSpec indicates that it
HasPerfArtifacts then that directory is copied to the test runner's
artifactsDir for that test from each node in the cluster upon success.

This change also necessitates a small change to roachperf to look in this new
"perf" directory.

Fixes cockroachdb#38871.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Jul 16, 2019
Before the roachtest rewrite in cockroachdb#30977 we used to collect the logs directory
from remote hosts unconditionally. We also used to write stats.json histogram
files to the logs directory. This combination enabled the collection of
histogram files for roachperf. Since the new roachtest only copies logs
to the test runner in the case of failure, we have not gotten any perf
artifacts since the change.

This log policy is sane for logs and other debug artifacts like heap profiles
that exist there but has been a bit of a bummer as we've missed out on the
dopamine rush due to seeing charts go up and to the right in bespoke D3 glory.
The truth is that we weren't all that excited about the previous behavior with
regards to perf data either. Sometimes tests would fail and their lousy results
would mess with our pretty pictures
(https://github.com/cockroachdb/roachperf/issues/37).

This PR introduces a new concept, perf artifacts, which live in a different
directory, the perfArtifactsDir ("perf") and has the opposite fetching policy
from the logs; the "perf" directory is copied from each remove host into the
test's artifactsDir only upon successful completion of the test.

One downside of this as implemented is that it can spam the test's log (but
fortunately not roachtest's logs anymore, <3 @andreimatei). cockroachdb#38890 attempts to
address this somewhat. It shouldn't been too big of a problem because it only
happens on tests that already passed.

This change also necessitates a small change to roachperf to look in this new
"perf" directory (and respect the rewrite's test artifact directory structure).

Fixes cockroachdb#38871.

Release note: None
craig bot pushed a commit that referenced this pull request Jul 16, 2019
38848: roachpb: further clarify correctness of observed timestamps r=nvanbenschoten a=nvanbenschoten

This commit expands on the explanation added in b7cc205. Specifically,
it focuses on the inability to bound a transaction's uncertainty interval
below the starting timestamp of the lease that it is executing a request
on. This limitation was added in 85d4627. I tend to forget about it once
every few months and get very confused how this all works.

Release note: None

38889: roachtest: create perf artifacts concept and collect them r=nvanbenschoten a=ajwerner

Before the roachtest rewrite in #30977 we used to collect the logs directory
from remote hosts unconditionally. We also used to write stats.json histogram
files to the logs directory. This combination enabled the collection of
histogram files for roachperf. Since the new roachtest does not unconditionally
copy logs to the test runner, we have not gotten any perf artifacts since the
change.

This PR introduces a new concept, perf artifacts, which live in a different
directory, the perfArtifactsDir. If a testSpec indicates that it
HasPerfArtifacts then that directory is copied to the test runner's
artifactsDir for that test from each node in the cluster upon success.

This change also necessitates a small change to roachperf to look in this new
"perf" directory.

Fixes #38871.

Release note: None

38895: roachprod: disable ssh logging r=ajwerner a=tbg

This was never fully introduced and there is currently no impetus to
push it over the finish line. It would be useful to diagnose issues
that are readily reproducible, but we haven't seen any of those.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Tobias Schottdorf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants