metrics: amalgamate the two metrics invoke scripts #912

grahamwhaley · 2018-02-15T11:48:32Z

We currently have two metrics scripts:

They perform different roles (literally run all the metrics tests, or just run the CI metrics subset under controlled circumstances), but there is some crossover.

Evaluate if they can be combined into a single script. Some points:

It is not clear they can, given the CI script does things with KSM, and sets specific parameters for the tests to make the results suitable for metrics (that is, stability is preferred over performance)
the metrics script also has knowledge of how and where to compare or report results.
Notionally the 'run all' script should run for performance.

Maybe we need a third script (indirection), which has parameters to control if we are running for performance or stability, and which other features (KSM, comparing, reporting) we want enabled.

tbh, after writing this, I suspect we are fine as we are, without adding in a bunch of script complexity that currently would not gain us any extra functionality.

@jodh-intel - wdyt?

jodh-intel · 2018-02-15T12:12:01Z

My concern is that if we're running a subset of metrics tests for the metrics CI that we forget to review the disabled tests periodically (bitrot might set in, or we might determine in the future that we could re-enable certain tests). Hence, putting all the code together may minimise those risks (and reduce code duplication).

However, I'm not as close as @grahamwhaley is to all this code to comment further.

@chavafg - do you have any thoughts on this?

grahamwhaley · 2018-02-15T13:26:30Z

Right, valid point. Some (but not all - some are outdated, or special cases, or duplicates etc.) of the disabled tests can be used 'longer term' to detect regressions. My thoughts on this so far have been:

People want a 'quick answer on a PR' to know if they seriously bust something or not - that is what the current run_metrics_ci.sh is aimed at. We will run that on every PR.
And then we need a longer term strategy to pick up the 'death of a thousand cuts' and/or pick up other tests that we don't run in the PR CI as they are too noisy or take too long. My idea here is that we run those, as a larger/longer CI test run, on the master branch merges. We have the CI set up already this way, just today we run run_metric_ci.sh on both trees. I also think we may not do the master 'bigger' CI tracking in a script, but more likely in some higher level data tracking/processing system line maybe influxdb/grafana or elastic/kibana or similar. That is a WIP

jodh-intel · 2018-02-19T12:19:38Z

Sounds good to me. I guess if we do find regressions on master, it could take a while to bisect the issue but that seems a reasonable price to pay to have a relatively quick CI turnaround.

wdyt @chavafg, @sboeuf?

grahamwhaley · 2018-02-19T13:42:24Z

One more thought somebody injected - as well as the 'quick' CI on the PRs, we could have a slower back-burner CI doing more extensive tests on the PRs as well. It would not report back to the blocking ack/nack status on github, but if it found a regression then could post a message to the comment thread. That means it would not hold up PR merges, and could also post results to a merged PR if later on it found there actually was a regression. Just a thought.

jodh-intel · 2018-02-19T14:23:32Z

That also sounds interesting. I think that realistically we're going to have to have this bipartite PnP system to keep the project momentum. But another dimension is release PRs: we don't want to make a release and create the corresponding OBS binaries if at some after the release we then discover PnP issues.

Hence, we should tie some PnP process into:

https://github.com/clearcontainers/runtime/wiki/Release-Checklist

We could run the full battery of PnP tests and crucially wait for the full results:

On a release PR

Pros

✅ Simple
✅ Logical
(Run the full PnP on the tip of master, being what we plan to release)

Cons

❌ Will slow down the release process

On all PRs that will go into a release

Pros

✅ Means the release process itself will be faster
(as we'll have all the PnP results for the new PRs in that release)

Cons

❌ Potentially counter-intuitive
❌ Would require that we stop landing PRs into master some time before we plan to release to allow the full PnP to run (might be > 24 hours).

In summary, would we rather take the pain (time hit):

early ("On all PRs that will go into a release"), or
late ("On a release PR")?

/cc @jcvenegas?

Aside: I'm amazed there is no decent pair of emojis for tick and cross. Call me "thumbist", but I don't like the look of 👍 and 👎 😄

setup: add CI default value on setup.sh

jodh-intel mentioned this issue Feb 19, 2018

Refine metrics CI invocations #909

Merged

mcastelino pushed a commit to mcastelino/tests that referenced this issue Jan 23, 2019

Merge pull request clearcontainers#912 from chavafg/topic/fix_setup

9ae46fd

setup: add CI default value on setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: amalgamate the two metrics invoke scripts #912

metrics: amalgamate the two metrics invoke scripts #912

grahamwhaley commented Feb 15, 2018

jodh-intel commented Feb 15, 2018

grahamwhaley commented Feb 15, 2018

jodh-intel commented Feb 19, 2018

grahamwhaley commented Feb 19, 2018

jodh-intel commented Feb 19, 2018

metrics: amalgamate the two metrics invoke scripts #912

metrics: amalgamate the two metrics invoke scripts #912

Comments

grahamwhaley commented Feb 15, 2018

jodh-intel commented Feb 15, 2018

grahamwhaley commented Feb 15, 2018

jodh-intel commented Feb 19, 2018

grahamwhaley commented Feb 19, 2018

jodh-intel commented Feb 19, 2018

On a release PR

Pros

Cons

On all PRs that will go into a release

Pros

Cons