-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: rewrite the test runner #30977
Conversation
The first commit is #30146. Peter, this patch is nowhere ready, but I'd like a nod on the direction pls. |
0f27000
to
77ac0ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, --parallelism
was explicitly limited to the number of tests by this code:
// Limit the parallelism to the number of tests. The primary effect this has
// is that we'll log to stdout/stderr if only one test is being run.
if parallelism > len(tests) {
parallelism = len(tests)
}
Perhaps we should just remove that code once your PR to place each run's output in a separate directory is merged. If you want to run a single test multiple times sequentially you can already do --count 100 --parallelism 1
.
Reviewed 4 of 4 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that code is going away.
I'm not sure what your comment says about the current PR though (if anything). Are you suggesting this PR is not necessary? (cause this PR is all about cluster reuse)
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to ask - how does one use/test the roachtest run --cluster
option? Is there an easy way to create a cluster that matches a particular test's expectation? Am I supposed to cause a test to fail and use the --debug
option to leave a cluster around?
Also, is the roachtest
invocation supposed to take ownership of such a cluster? Like, is it supposed to destroy it at the end?
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disregard the last comment; I meant to ask on another PR :P
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what your comment says about the current PR though (if anything). Are you suggesting this PR is not necessary? (cause this PR is all about cluster reuse)
My comment was in reference to the PR title which only talks about parallelism for repeated runs of the same test. I failed to notice that you're also reusing clusters (yes, this is mentioned in the commit message which I read too fast). Let me take a closer look.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/test.go, line 637 at r2 (raw file):
} sem := make(chan struct{}, parallelism) clustersPerTLT := parallelism / n
What happens if parallelism == 1
and n == 2
? Seems like clustersPerTLT
will be 0
. Isn't that problematic, or is that what the adjustment below is trying to handle.
pkg/cmd/roachtest/test.go, line 643 at r2 (raw file):
numClusters := clustersPerTLT if i < parallelism%n { numClusters++
I don't understand what this is doing.
pkg/cmd/roachtest/test.go, line 715 at r2 (raw file):
artifactsDir: artifactsDir, } f.Put(newCluster(ctx, t, spec.Nodes))
Won't multiple concurrent cluster creation collide on both the cluster name and the artifacts dir?
77ac0ec
to
fd0503c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't seem to be against anything, so I got what I wanted for now.
Will ping when it's ready.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/test.go, line 637 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
What happens if
parallelism == 1
andn == 2
? Seems likeclustersPerTLT
will be0
. Isn't that problematic, or is that what the adjustment below is trying to handle.
It's not supposed to be problematic. The different runTopLevelTest
invocations will contend on sem for creating new clusters.
pkg/cmd/roachtest/test.go, line 643 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I don't understand what this is doing.
it's giving an extra cluster to some tests if the division or parallelism to tests had a reminder. So in your example (parallelism == 1 and n == 2), the first test would get a guaranteed cluster, and the second one would wait until the first one does all its runs.
pkg/cmd/roachtest/test.go, line 715 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Won't multiple concurrent cluster creation collide on both the cluster name and the artifacts dir?
yes (except I think you wanted to say "cluster name and log file"; the artifacts dir isn't an issue any more, as that's now a property of a test, not of a cluster ).
I told you it wasn't ready :P. I'll keep you posted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't seem to be against anything, so I got what I wanted for now.
Will ping when it's ready.
Ack. The assignment of parallelism to clusters is a bit more subtle than desirable, though I suppose the same could be said of much of the roachtest code. The channel of channels is interesting. I don't have a better suggestion.
Reviewable status: complete! 0 of 0 LGTMs obtained
929a58e
to
8a35cca
Compare
b0e515a
to
0d9dfb5
Compare
0d9dfb5
to
bd1f3d1
Compare
@benesch I'd appreciate a high-level look over this. There's many i's to dot still and I haven't really run the thing, but I think it's ready for a first look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only skimmed parts of this and I likely won't have time to give it a thorough look until Friday. I suggest we hold off on merging this until 2.1 is out the door as I expect it to cause some short term instability. Or you have to find a way to break it into smaller pieces.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/main.go, line 196 at r5 (raw file):
cmd.Flags().IntVar( &cpuQuota, "cpu-quota", 100, "The number of cloud CPUs roachtest is allowed to use at any one time.")
What happens if a single test wants to use more CPU?
pkg/cmd/roachtest/test.go, line 90 at r5 (raw file):
// Any means that the cluster can be used by any other test. var Any = ClusterReusePolicy{policy: any}
My intuition is that we should somehow fold this into the nodeSpec
. The benefit of your approach is that you can upgrade an Any
cluster to a tagged cluster, but I'm not sure that is worth supporting.
The nodeSpec
approach fits more seamlessly into a world where "nobarrier" and "zfs" are also part of the nodeSpec.nodes(4, tag("acceptance"))
could be used to specify a cluster that is shared by the acceptance tests (they would all have this same specification). nodes(6, noreuse)
could be used to indicate a cluster that is not reused. I haven't looked at how you're checking to see if two cluster specs are compatible, but this seems to fit in naturally. nodes(4, nobarrier)
could indicate that the disks are to be remounted with the "nobarrier" option. You could even imagine sharing a nodes(4)
and nodes(4, nobarrier)
cluster by remounting the disks automatically before starting the test. I'm not suggesting you do that work in this PR, just that the nodeSpec
mechanism seems like the right place to be specifying cluster reuse.
pkg/cmd/roachtest/test.go, line 911 at r5 (raw file):
for i := 0; i < parallelism; i++ { wg.Add(1) stopper.RunWorker(ctx, func(ctx context.Context) {
I'm not seeing what a stopper
is adding here. Why not just go func() { ... }
and a channel which is closed to indicated that runWorker
should exit?
pkg/cmd/roachtest/test.go, line 974 at r5 (raw file):
name = fmt.Sprintf("%d", timeutil.Now().Unix()) } name += "-" + t.Name
Including the test name in the cluster name is going to be weird once clusters are being reused across unassociated tests. If the cluster is tagged, I think the tag should be included instead. I'm not sure what to do about untagged clusters.
37e5135
to
a65b664
Compare
ada1dfc
to
26c161f
Compare
c507175
to
11841d3
Compare
nit: print the artifacts dir when a test fails
I basically manually have to go find the directory (the test runner log file is similarly opaque). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print the artifacts dir when a test fails
done
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bobvawter, @nvanbenschoten, @petermattis, and @tbg)
This patch adds facilities from taking quota out of a pool - i.e. the client informing the pool that it's never getting back quota that it handed out and it should deal with it. Taking out quota is done through a new IntAlloc.Freeze() method. The effects are that ongoing Alloc() requests are trimmed to the new capacity, and ongoing AllocFunc() requests are woken up and given a more expansive interface into the pool for figuring out the current capacity. Additionally, the closure passed to AllocFunc() can now return an error to interrupt the respective request. This is useful for roachtest - when a cluster is "saved for debugging", its cpu quota is effectively leaked. This PR gives us a way for roachtest to then realize that there's not enough quota left for other tests. Release note: None
The point of subtests was that all the subtests under a root share a cluster. This patch introduces more flexible policies for sharing clusters across tests (only the declaration of these policies is here; the implementation is in future commits). The policies replace the need for subtests, so they go away, simplifying the code considerably. Release note: None
92f7911
to
e83078f
Compare
This patch reimplements the roachtest test runner. Some issues with the previous version: - roachtest --paralleism is ignored when running a single test - repeated runs of that test don't reuse clusters - cluster reuse is achieved through an inflexible subtest mechanism - there's no way to control the quota used on GCE. There's only a way to control the number of tests that run in parallel This patch takes a clean-slate approach to the runner. It introduces a runner that takes a --cpu-quota (#cloud cpus) and then, building on the cluster reuse policies introduced in the previous commit, tries to schedule tests most efficiently and with the greatest parallelism while staying under that quota (and also while staying under the --parallelism, which now acts purely to protect against too many tests mthrashing the local machine). The runner starts --parallelism workers and lets them compete on a QuotaPool which manages the cloud quota. Tests are seen abstractly as units of work. There's no more subtests, and --count is supported naturally as more work. The scheduling policy is not advanced: at any point in time, a worker has a cluster and will continuously select tests that can reuse that cluster. When no more tests can reuse it, the cluster is destroyed, new resources acquired and a new cluster created. Within multiple tests that can all reuse a cluster, ones that don't soil the cluster are preferred. Within otherwise equal tests, the ones with more runs remaining are preferred. Release note: None
e83078f
to
0a776f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
:S
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @bobvawter, @nvanbenschoten, @petermattis, and @tbg)
30977: roachtest: rewrite the test runner r=andreimatei a=andreimatei This patch reimplements the roachtest test runner. Some issues with the previous version: - roachtest --paralleism is ignored when running a single test - repeated runs of that test don't reuse clusters - cluster reuse is achieved through an inflexible subtest mechanism - there's no way to control the quota used on GCE. There's only a way to control the number of tests that run in parallel This patch takes a clean-slate approach to the runner. It introduces a runner that takes a --cpu-quota (#cloud cpus) and then, building on the cluster reuse policies introduced in the previous commit, tries to schedule tests most efficiently and with the greatest parallelism while staying under that quota (and also while staying under the --parallelism, which now acts purely to protect against too many tests mthrashing the local machine). The runner starts --parallelism workers and lets them compete on a ResourceGovernor which manages the cloud quota. Tests are seen abstractly as units of work. There's no more subtests, and --count is supported naturally as more work. The scheduling policy is not advanced: at any point in time, a worker has a cluster and will continuously select tests that can reuse that cluster. When no more tests can reuse it, the cluster is destroyed, new resources acquired and a new cluster created. Within multiple tests that can all reuse a cluster, ones that don't soil the cluster are preferred. Within otherwise equal tests, the ones with more runs remaining are preferred. Release note: None Co-authored-by: Andrei Matei <[email protected]>
Build succeeded |
Before the roachtest rewrite in cockroachdb#30977 we used to collect the logs directory from remote hosts unconditionally. We also used to write stats.json histogram files to the logs directory. This combination enabled the collection of histogram files for roachperf. Since the new roachtest does not unconditionally copy logs to the test runner, we have not gotten any perf artifacts since the change. This PR introduces a new concept, perf artifacts, which live in a different directory, the perfArtifactsDir. If a testSpec indicates that it HasPerfArtifacts then that directory is copied to the test runner's artifactsDir for that test from each node in the cluster upon success. This change also necessitates a small change to roachperf to look in this new "perf" directory. Fixes cockroachdb#38871. Release note: None
Before the roachtest rewrite in cockroachdb#30977 we used to collect the logs directory from remote hosts unconditionally. We also used to write stats.json histogram files to the logs directory. This combination enabled the collection of histogram files for roachperf. Since the new roachtest only copies logs to the test runner in the case of failure, we have not gotten any perf artifacts since the change. This log policy is sane for logs and other debug artifacts like heap profiles that exist there but has been a bit of a bummer as we've missed out on the dopamine rush due to seeing charts go up and to the right in bespoke D3 glory. The truth is that we weren't all that excited about the previous behavior with regards to perf data either. Sometimes tests would fail and their lousy results would mess with our pretty pictures (https://github.com/cockroachdb/roachperf/issues/37). This PR introduces a new concept, perf artifacts, which live in a different directory, the perfArtifactsDir ("perf") and has the opposite fetching policy from the logs; the "perf" directory is copied from each remove host into the test's artifactsDir only upon successful completion of the test. One downside of this as implemented is that it can spam the test's log (but fortunately not roachtest's logs anymore, <3 @andreimatei). cockroachdb#38890 attempts to address this somewhat. It shouldn't been too big of a problem because it only happens on tests that already passed. This change also necessitates a small change to roachperf to look in this new "perf" directory (and respect the rewrite's test artifact directory structure). Fixes cockroachdb#38871. Release note: None
38848: roachpb: further clarify correctness of observed timestamps r=nvanbenschoten a=nvanbenschoten This commit expands on the explanation added in b7cc205. Specifically, it focuses on the inability to bound a transaction's uncertainty interval below the starting timestamp of the lease that it is executing a request on. This limitation was added in 85d4627. I tend to forget about it once every few months and get very confused how this all works. Release note: None 38889: roachtest: create perf artifacts concept and collect them r=nvanbenschoten a=ajwerner Before the roachtest rewrite in #30977 we used to collect the logs directory from remote hosts unconditionally. We also used to write stats.json histogram files to the logs directory. This combination enabled the collection of histogram files for roachperf. Since the new roachtest does not unconditionally copy logs to the test runner, we have not gotten any perf artifacts since the change. This PR introduces a new concept, perf artifacts, which live in a different directory, the perfArtifactsDir. If a testSpec indicates that it HasPerfArtifacts then that directory is copied to the test runner's artifactsDir for that test from each node in the cluster upon success. This change also necessitates a small change to roachperf to look in this new "perf" directory. Fixes #38871. Release note: None 38895: roachprod: disable ssh logging r=ajwerner a=tbg This was never fully introduced and there is currently no impetus to push it over the finish line. It would be useful to diagnose issues that are readily reproducible, but we haven't seen any of those. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>
This patch reimplements the roachtest test runner. Some issues with the
previous version:
control the number of tests that run in parallel
This patch takes a clean-slate approach to the runner. It introduces a
runner that takes a --cpu-quota (#cloud cpus) and then, building on the
cluster reuse policies introduced in the previous commit, tries to
schedule tests most efficiently and with the greatest parallelism while
staying under that quota (and also while staying under the
--parallelism, which now acts purely to protect against too many tests
mthrashing the local machine).
The runner starts --parallelism workers and lets them compete on a
ResourceGovernor which manages the cloud quota.
Tests are seen abstractly as units of work. There's no more subtests,
and --count is supported naturally as more work.
The scheduling policy is not advanced: at any point in time, a worker
has a cluster and will continuously select tests that can reuse that
cluster. When no more tests can reuse it, the cluster is destroyed, new
resources acquired and a new cluster created. Within multiple tests that
can all reuse a cluster, ones that don't soil the cluster are preferred.
Within otherwise equal tests, the ones with more runs remaining are
preferred.
Release note: None