Skip to content
This repository has been archived by the owner on May 6, 2020. It is now read-only.

metrics: amalgamate the two metrics invoke scripts #912

Open
grahamwhaley opened this issue Feb 15, 2018 · 5 comments
Open

metrics: amalgamate the two metrics invoke scripts #912

grahamwhaley opened this issue Feb 15, 2018 · 5 comments

Comments

@grahamwhaley
Copy link
Contributor

We currently have two metrics scripts:

They perform different roles (literally run all the metrics tests, or just run the CI metrics subset under controlled circumstances), but there is some crossover.

Evaluate if they can be combined into a single script. Some points:

  • It is not clear they can, given the CI script does things with KSM, and sets specific parameters for the tests to make the results suitable for metrics (that is, stability is preferred over performance)
  • the metrics script also has knowledge of how and where to compare or report results.
  • Notionally the 'run all' script should run for performance.

Maybe we need a third script (indirection), which has parameters to control if we are running for performance or stability, and which other features (KSM, comparing, reporting) we want enabled.

tbh, after writing this, I suspect we are fine as we are, without adding in a bunch of script complexity that currently would not gain us any extra functionality.

@jodh-intel - wdyt?

@jodh-intel
Copy link
Contributor

My concern is that if we're running a subset of metrics tests for the metrics CI that we forget to review the disabled tests periodically (bitrot might set in, or we might determine in the future that we could re-enable certain tests). Hence, putting all the code together may minimise those risks (and reduce code duplication).

However, I'm not as close as @grahamwhaley is to all this code to comment further.

@chavafg - do you have any thoughts on this?

@grahamwhaley
Copy link
Contributor Author

Right, valid point. Some (but not all - some are outdated, or special cases, or duplicates etc.) of the disabled tests can be used 'longer term' to detect regressions. My thoughts on this so far have been:

  1. People want a 'quick answer on a PR' to know if they seriously bust something or not - that is what the current run_metrics_ci.sh is aimed at. We will run that on every PR.
  2. And then we need a longer term strategy to pick up the 'death of a thousand cuts' and/or pick up other tests that we don't run in the PR CI as they are too noisy or take too long. My idea here is that we run those, as a larger/longer CI test run, on the master branch merges. We have the CI set up already this way, just today we run run_metric_ci.sh on both trees. I also think we may not do the master 'bigger' CI tracking in a script, but more likely in some higher level data tracking/processing system line maybe influxdb/grafana or elastic/kibana or similar. That is a WIP

@jodh-intel
Copy link
Contributor

Sounds good to me. I guess if we do find regressions on master, it could take a while to bisect the issue but that seems a reasonable price to pay to have a relatively quick CI turnaround.

wdyt @chavafg, @sboeuf?

@grahamwhaley
Copy link
Contributor Author

One more thought somebody injected - as well as the 'quick' CI on the PRs, we could have a slower back-burner CI doing more extensive tests on the PRs as well. It would not report back to the blocking ack/nack status on github, but if it found a regression then could post a message to the comment thread. That means it would not hold up PR merges, and could also post results to a merged PR if later on it found there actually was a regression. Just a thought.

@jodh-intel
Copy link
Contributor

That also sounds interesting. I think that realistically we're going to have to have this bipartite PnP system to keep the project momentum. But another dimension is release PRs: we don't want to make a release and create the corresponding OBS binaries if at some after the release we then discover PnP issues.

Hence, we should tie some PnP process into:

We could run the full battery of PnP tests and crucially wait for the full results:

On a release PR

Pros

  • ✅ Simple
  • ✅ Logical
    (Run the full PnP on the tip of master, being what we plan to release)

Cons

  • ❌ Will slow down the release process

On all PRs that will go into a release

Pros

  • ✅ Means the release process itself will be faster
    (as we'll have all the PnP results for the new PRs in that release)

Cons

  • ❌ Potentially counter-intuitive
  • ❌ Would require that we stop landing PRs into master some time before we plan to release to allow the full PnP to run (might be > 24 hours).

In summary, would we rather take the pain (time hit):

  • early ("On all PRs that will go into a release"), or
  • late ("On a release PR")?

/cc @jcvenegas?


Aside: I'm amazed there is no decent pair of emojis for tick and cross. Call me "thumbist", but I don't like the look of 👍 and 👎 😄

mcastelino pushed a commit to mcastelino/tests that referenced this issue Jan 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants