Skip to content
This repository has been archived by the owner on Nov 6, 2019. It is now read-only.

Perform a health/sanity check on results before serving them live #234

Closed
foolip opened this issue Nov 12, 2017 · 14 comments
Closed

Perform a health/sanity check on results before serving them live #234

foolip opened this issue Nov 12, 2017 · 14 comments

Comments

@foolip
Copy link
Member

foolip commented Nov 12, 2017

Related to #233.

We will most likely keep having runs that aren't complete. Before a run is treated as valid for exposing on the front page, we should have some checks that catch obviously broken things, like:

  • The total number of results is no more than 1000 less than the last good run.
  • The number of regression (pass to not pass) is less than 1000.

Any condition we pick will sometimes be violated for good reasons, so some manual override will be needed. Alternatively, we could serve the results but show a warning and auto-file a GitHub issue to investigate.

@rwaldron
Copy link
Contributor

rwaldron commented Nov 13, 2017

In support of this, the test run for "chrome 63.0 linux 3.16 @e0a58d4cac Nov 13 2017" has only 943 tests recorded: https://storage.googleapis.com/wptd/e0a58d4cac/chrome-63.0-linux-summary.json.gz

Compare this with:

Browser Tests Run and Recorded Summary
edge-15-windows-10-sauce 23,927 https://storage.googleapis.com/wptd/1db039e5b6/edge-15-windows-10-sauce-summary.json.gz
firefox-57.0-linux 11,149 https://storage.googleapis.com/wptd/5d55258739/firefox-57.0-linux-summary.json.gz
safari-11.0-macos-10.12-sauce 23,930 https://storage.googleapis.com/wptd/e0a58d4cac/safari-11.0-macos-10.12-sauce-summary.json.gz

@foolip
Copy link
Member Author

foolip commented Nov 14, 2017

Thanks @rwaldron, that's very helpful! (I'll edit the table to say firefox-57.0-linux on the second row, matching the URL.)

@mattl
Copy link
Contributor

mattl commented Nov 20, 2017

@mdittmer This would be a good place to discuss monitoring these things. I think monitoring is related to this task.

@lukebjerring
Copy link
Collaborator

Some monitoring would be great to catch issues like #273 early.

@foolip
Copy link
Member Author

foolip commented Jan 2, 2018

An example of automated updates of Chrome over the holidays that show the need for this:
screen shot 2018-01-02 at 9 48 51 pm

@foolip
Copy link
Member Author

foolip commented Jan 26, 2018

I was going to file a new urgent issue, but since this is the root cause, I'll just note here that the latest Safari run is very busted:
screen shot 2018-01-26 at 3 28 17 pm

@foolip
Copy link
Member Author

foolip commented Jan 26, 2018

The last Safari run that was pretty OK was https://wpt.fyi/?sha=13eaad17a4

@mattl
Copy link
Contributor

mattl commented Jan 26, 2018

Fixed up Safari (it ran four times! all four were broken) and uploaded new Fx and Edge runs.

@foolip
Copy link
Member Author

foolip commented Jan 27, 2018

Fixed up Safari (it ran four times! all four were broken) and uploaded new Fx and Edge runs.

Not sure if you mean that you ran Safari a fifth time with success, but most directories past editing now have timeouts instead of no results at all.

@mariestaver
Copy link
Collaborator

We plan to address this, but as the architecture is currently in-motion, it's not clear what the best path for implementing a check of this nature is. Returning to the backlog until things are more stabilized.

@mattl mattl assigned jugglinmike and unassigned mattl Feb 12, 2018
@jugglinmike
Copy link
Collaborator

I'd like to frame this discussion in terms of test, not test files or subtests. (I'll emphasize those terms in this comment. Sorry if this seems heavy-handed, but the distinction is both confusing and important for this discussion.)

For example, here are the contents of /console/console-timeline-timelineEnd-historical.any.js:

"use strict";

test(() => {
  assert_equals(console.timeline, undefined, "console.timeline should be undefined");
}, "'timeline' function should not exist on the console object");

test(() => {
  assert_equals(console.timelineEnd, undefined, "console.timelineEnd should be undefined");
}, "'timelineEnd' function should not exist on the console object");

This is one test file. In WPT, it's known as a "multi-global test", so it actually expands to two tests. And since it invokes the test function provided by testharness.js twice, it describes a total of four subtests.

Test files are mostly useless for any sort of reporting purposes. That's because all results are in terms of tests--the fact that some of them are expanded from the same file is an implementation detail and largely hidden in the results data (both on the dashboard and also as provided by the WPT CLI itself).

The current WPT Dashboard IU describes results in terms of subtests, but this is not a great metric for the performance of the testing infrastructure. That's because in a large portion of tests, subtests are defined programmatically. This make determining the expected number of subtests impossible. It also means that the runtime behavior can alter the number of results reported (e.g. issue #9180 of
WPT
).

Evaluating performance based on test count is preferable because:

  1. We can statically determine how many tests are expected to run for any revision of WPT (via the --list-tests option of the WPT CLI)
  2. The actual number of tests executed is not dependent on implementation status

(These aspects might also inform future UI enhancements, but that's a topic for another day.)

I wrote a script to analyze the data sets published on the front page of wpt.fyi today. Here are the results (again, in terms of tests):

  • edge-15-windows-10-sauce @ 7a4c0d6617 - 2018-01-22 12:03:31 +0800
    Actual: 24778
    Expected: 24907
    99.4820733127%
  • chrome-63.0-linux @ fade9e6a04 - 2018-02-09 10:32:42 -0800
    Actual: 25151
    Expected: 25217
    99.7382718008%
  • firefox-57.0-linux @ 9d90463821 - 2018-02-11 18:19:54 -0800
    Actual: 25218
    Expected: 25225
    99.9722497522%
  • safari-11.0-macos-10.12-sauce @ 1f2505669c - 2018-02-02 11:21:17 -0800
    Actual: 25067
    Expected: 25128
    99.7572429163%

So by these standards, the WPT Dashboard is actually doing quite well. The holes in the data are most evident for directories that contain a small number of tests (i.e. device memory and picture-in-picture). These are particularly vulnerable to small errors. They tend to describe the most cutting-edge technologies (which explains why there are so few tests).

In a meeting yesterday, @rwaldron suggested a concrete heuristic for rejecting "partial" data: that WPTD should refuse to publish data sets where any top-level directory has zero subtest results. This sounds like a goal with user-facing value, but because of the vulnerability described above, it would disqualify many (possibly all) of the datasets that have ever been published.

Here's how I'd like to proceed:

  • For now, I would like to move forward with the following heuristic: at least 98% of the expected tests have results. With that in place, we'll be safe from the kinds of egregious errors that Philip reported above)
  • In the medium-term (and to @rwaldron's point), I'd like to resolve whatever it is that is consistently causing harness failure in those top-level directories.
  • In the long term, we should insist on 100% test execution. This will likely entail enhancements to the WPT continuous integration process: we shouldn't accept tests that do not run in some browsers!

@foolip and @rwaldron does this sound good to you?

@rwaldron
Copy link
Contributor

@jugglinmike that works for me.

@jugglinmike
Copy link
Collaborator

This issue was created without a concrete goal and, the discussion that ensued obscures our current target. In light of that, I've created two new issues to make our current objectives (and progress towards them) a little more clear: gh-465 and gh-466.

@foolip
Copy link
Member Author

foolip commented Feb 24, 2018

Thanks @jugglinmike for the detailed analysis and suggestions. The main problem to solve here is indeed the very incomplete runs, and setting the bar at 98% sounds like a great starting point.

On the terminology, I have in the past insisted on calling the individual testharness.js test, async_test and promise_test "instances" simply tests, and the files just files, but in the face of multi-global tests and even just plain reftests this terminology doesn't work. So I'm going to embrace "subtest" and hope we can perhaps be consistent in code and documentation as well, in the end.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants