Perform a health/sanity check on results before serving them live #234

foolip · 2017-11-12T14:13:16Z

Related to #233.

We will most likely keep having runs that aren't complete. Before a run is treated as valid for exposing on the front page, we should have some checks that catch obviously broken things, like:

The total number of results is no more than 1000 less than the last good run.
The number of regression (pass to not pass) is less than 1000.

Any condition we pick will sometimes be violated for good reasons, so some manual override will be needed. Alternatively, we could serve the results but show a warning and auto-file a GitHub issue to investigate.

rwaldron · 2017-11-13T21:42:29Z

In support of this, the test run for "chrome 63.0 linux 3.16 @e0a58d4cac Nov 13 2017" has only 943 tests recorded: https://storage.googleapis.com/wptd/e0a58d4cac/chrome-63.0-linux-summary.json.gz

Compare this with:

Browser	Tests Run and Recorded	Summary
edge-15-windows-10-sauce	23,927	https://storage.googleapis.com/wptd/1db039e5b6/edge-15-windows-10-sauce-summary.json.gz
firefox-57.0-linux	11,149	https://storage.googleapis.com/wptd/5d55258739/firefox-57.0-linux-summary.json.gz
safari-11.0-macos-10.12-sauce	23,930	https://storage.googleapis.com/wptd/e0a58d4cac/safari-11.0-macos-10.12-sauce-summary.json.gz

foolip · 2017-11-14T10:45:25Z

Thanks @rwaldron, that's very helpful! (I'll edit the table to say firefox-57.0-linux on the second row, matching the URL.)

mattl · 2017-11-20T20:53:21Z

@mdittmer This would be a good place to discuss monitoring these things. I think monitoring is related to this task.

lukebjerring · 2017-11-20T20:56:03Z

Some monitoring would be great to catch issues like #273 early.

foolip · 2018-01-02T20:49:31Z

An example of automated updates of Chrome over the holidays that show the need for this:

foolip · 2018-01-26T08:29:03Z

I was going to file a new urgent issue, but since this is the root cause, I'll just note here that the latest Safari run is very busted:

foolip · 2018-01-26T08:31:06Z

The last Safari run that was pretty OK was https://wpt.fyi/?sha=13eaad17a4

mattl · 2018-01-26T19:21:38Z

Fixed up Safari (it ran four times! all four were broken) and uploaded new Fx and Edge runs.

foolip · 2018-01-27T18:29:10Z

Fixed up Safari (it ran four times! all four were broken) and uploaded new Fx and Edge runs.

Not sure if you mean that you ran Safari a fifth time with success, but most directories past editing now have timeouts instead of no results at all.

mariestaver · 2018-02-09T21:50:01Z

We plan to address this, but as the architecture is currently in-motion, it's not clear what the best path for implementing a check of this nature is. Returning to the backlog until things are more stabilized.

jugglinmike · 2018-02-13T18:27:39Z

I'd like to frame this discussion in terms of test, not test files or subtests. (I'll emphasize those terms in this comment. Sorry if this seems heavy-handed, but the distinction is both confusing and important for this discussion.)

For example, here are the contents of /console/console-timeline-timelineEnd-historical.any.js:

"use strict";

test(() => {
  assert_equals(console.timeline, undefined, "console.timeline should be undefined");
}, "'timeline' function should not exist on the console object");

test(() => {
  assert_equals(console.timelineEnd, undefined, "console.timelineEnd should be undefined");
}, "'timelineEnd' function should not exist on the console object");

This is one test file. In WPT, it's known as a "multi-global test", so it actually expands to two tests. And since it invokes the test function provided by testharness.js twice, it describes a total of four subtests.

Test files are mostly useless for any sort of reporting purposes. That's because all results are in terms of tests--the fact that some of them are expanded from the same file is an implementation detail and largely hidden in the results data (both on the dashboard and also as provided by the WPT CLI itself).

The current WPT Dashboard IU describes results in terms of subtests, but this is not a great metric for the performance of the testing infrastructure. That's because in a large portion of tests, subtests are defined programmatically. This make determining the expected number of subtests impossible. It also means that the runtime behavior can alter the number of results reported (e.g. issue #9180 of
WPT).

Evaluating performance based on test count is preferable because:

We can statically determine how many tests are expected to run for any revision of WPT (via the --list-tests option of the WPT CLI)
The actual number of tests executed is not dependent on implementation status

(These aspects might also inform future UI enhancements, but that's a topic for another day.)

I wrote a script to analyze the data sets published on the front page of wpt.fyi today. Here are the results (again, in terms of tests):

edge-15-windows-10-sauce @ 7a4c0d6617 - 2018-01-22 12:03:31 +0800
Actual: 24778
Expected: 24907
99.4820733127%
chrome-63.0-linux @ fade9e6a04 - 2018-02-09 10:32:42 -0800
Actual: 25151
Expected: 25217
99.7382718008%
firefox-57.0-linux @ 9d90463821 - 2018-02-11 18:19:54 -0800
Actual: 25218
Expected: 25225
99.9722497522%
safari-11.0-macos-10.12-sauce @ 1f2505669c - 2018-02-02 11:21:17 -0800
Actual: 25067
Expected: 25128
99.7572429163%

So by these standards, the WPT Dashboard is actually doing quite well. The holes in the data are most evident for directories that contain a small number of tests (i.e. device memory and picture-in-picture). These are particularly vulnerable to small errors. They tend to describe the most cutting-edge technologies (which explains why there are so few tests).

In a meeting yesterday, @rwaldron suggested a concrete heuristic for rejecting "partial" data: that WPTD should refuse to publish data sets where any top-level directory has zero subtest results. This sounds like a goal with user-facing value, but because of the vulnerability described above, it would disqualify many (possibly all) of the datasets that have ever been published.

Here's how I'd like to proceed:

For now, I would like to move forward with the following heuristic: at least 98% of the expected tests have results. With that in place, we'll be safe from the kinds of egregious errors that Philip reported above)
In the medium-term (and to @rwaldron's point), I'd like to resolve whatever it is that is consistently causing harness failure in those top-level directories.
In the long term, we should insist on 100% test execution. This will likely entail enhancements to the WPT continuous integration process: we shouldn't accept tests that do not run in some browsers!

@foolip and @rwaldron does this sound good to you?

rwaldron · 2018-02-13T18:37:36Z

@jugglinmike that works for me.

jugglinmike · 2018-02-15T17:31:47Z

This issue was created without a concrete goal and, the discussion that ensued obscures our current target. In light of that, I've created two new issues to make our current objectives (and progress towards them) a little more clear: gh-465 and gh-466.

foolip · 2018-02-24T10:37:42Z

Thanks @jugglinmike for the detailed analysis and suggestions. The main problem to solve here is indeed the very incomplete runs, and setting the bar at 98% sounds like a great starting point.

On the terminology, I have in the past insisted on calling the individual testharness.js test, async_test and promise_test "instances" simply tests, and the files just files, but in the face of multi-global tests and even just plain reftests this terminology doesn't work. So I'm going to embrace "subtest" and hope we can perhaps be consistent in code and documentation as well, in the end.

foolip added front-end test-running labels Nov 12, 2017

mattl self-assigned this Nov 13, 2017

mattl mentioned this issue Nov 13, 2017

Most chrome and firefox results currently missing #233

Closed

foolip mentioned this issue Jan 2, 2018

Latest run upload was 2 weeks ago (Nov 14) #273

Closed

rwaldron added priority:urgent Needs Attention labels Jan 18, 2018

foolip mentioned this issue Jan 24, 2018

When the machine receives a payload, it is verified (n% of tasks passed) and saved locally. Another payload is sent to machines in the pool. #399

Closed

mariestaver removed the priority:urgent label Jan 29, 2018

mattl assigned jugglinmike and unassigned mattl Feb 12, 2018

mariestaver removed the Needs Attention label Feb 13, 2018

jugglinmike mentioned this issue Feb 15, 2018

Reject results that include fewer than 98% of existing tests #465

Closed

jugglinmike mentioned this issue Feb 15, 2018

Reject incomplete result sets #466

Open

jugglinmike closed this as completed Feb 15, 2018

jugglinmike mentioned this issue Feb 16, 2018

Profile the number and frequency of intermittent test errors #431

Closed

jugglinmike mentioned this issue Apr 5, 2018

Research: Why do non-Sauce Labs browsers report fewer than 100% of results? #478

Open

plehegar unassigned jugglinmike Apr 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform a health/sanity check on results before serving them live #234

Perform a health/sanity check on results before serving them live #234

foolip commented Nov 12, 2017

rwaldron commented Nov 13, 2017 •

edited by foolip

Loading

foolip commented Nov 14, 2017

mattl commented Nov 20, 2017

lukebjerring commented Nov 20, 2017

foolip commented Jan 2, 2018

foolip commented Jan 26, 2018

foolip commented Jan 26, 2018

mattl commented Jan 26, 2018

foolip commented Jan 27, 2018

mariestaver commented Feb 9, 2018

jugglinmike commented Feb 13, 2018

rwaldron commented Feb 13, 2018

jugglinmike commented Feb 15, 2018

foolip commented Feb 24, 2018 •

edited

Loading

Perform a health/sanity check on results before serving them live #234

Perform a health/sanity check on results before serving them live #234

Comments

foolip commented Nov 12, 2017

rwaldron commented Nov 13, 2017 • edited by foolip Loading

foolip commented Nov 14, 2017

mattl commented Nov 20, 2017

lukebjerring commented Nov 20, 2017

foolip commented Jan 2, 2018

foolip commented Jan 26, 2018

foolip commented Jan 26, 2018

mattl commented Jan 26, 2018

foolip commented Jan 27, 2018

mariestaver commented Feb 9, 2018

jugglinmike commented Feb 13, 2018

rwaldron commented Feb 13, 2018

jugglinmike commented Feb 15, 2018

foolip commented Feb 24, 2018 • edited Loading

rwaldron commented Nov 13, 2017 •

edited by foolip

Loading

foolip commented Feb 24, 2018 •

edited

Loading