-
Notifications
You must be signed in to change notification settings - Fork 46
Perform a health/sanity check on results before serving them live #234
Comments
In support of this, the test run for "chrome 63.0 linux 3.16 @e0a58d4cac Nov 13 2017" has only 943 tests recorded: https://storage.googleapis.com/wptd/e0a58d4cac/chrome-63.0-linux-summary.json.gz Compare this with:
|
Thanks @rwaldron, that's very helpful! (I'll edit the table to say firefox-57.0-linux on the second row, matching the URL.) |
@mdittmer This would be a good place to discuss monitoring these things. I think monitoring is related to this task. |
Some monitoring would be great to catch issues like #273 early. |
The last Safari run that was pretty OK was https://wpt.fyi/?sha=13eaad17a4 |
Fixed up Safari (it ran four times! all four were broken) and uploaded new Fx and Edge runs. |
Not sure if you mean that you ran Safari a fifth time with success, but most directories past editing now have timeouts instead of no results at all. |
We plan to address this, but as the architecture is currently in-motion, it's not clear what the best path for implementing a check of this nature is. Returning to the backlog until things are more stabilized. |
I'd like to frame this discussion in terms of test, not test files or subtests. (I'll emphasize those terms in this comment. Sorry if this seems heavy-handed, but the distinction is both confusing and important for this discussion.) For example, here are the contents of
This is one test file. In WPT, it's known as a "multi-global test", so it actually expands to two tests. And since it invokes the Test files are mostly useless for any sort of reporting purposes. That's because all results are in terms of tests--the fact that some of them are expanded from the same file is an implementation detail and largely hidden in the results data (both on the dashboard and also as provided by the WPT CLI itself). The current WPT Dashboard IU describes results in terms of subtests, but this is not a great metric for the performance of the testing infrastructure. That's because in a large portion of tests, subtests are defined programmatically. This make determining the expected number of subtests impossible. It also means that the runtime behavior can alter the number of results reported (e.g. issue #9180 of Evaluating performance based on test count is preferable because:
(These aspects might also inform future UI enhancements, but that's a topic for another day.) I wrote a script to analyze the data sets published on the front page of wpt.fyi today. Here are the results (again, in terms of tests):
So by these standards, the WPT Dashboard is actually doing quite well. The holes in the data are most evident for directories that contain a small number of tests (i.e. device memory and picture-in-picture). These are particularly vulnerable to small errors. They tend to describe the most cutting-edge technologies (which explains why there are so few tests). In a meeting yesterday, @rwaldron suggested a concrete heuristic for rejecting "partial" data: that WPTD should refuse to publish data sets where any top-level directory has zero subtest results. This sounds like a goal with user-facing value, but because of the vulnerability described above, it would disqualify many (possibly all) of the datasets that have ever been published. Here's how I'd like to proceed:
|
@jugglinmike that works for me. |
Thanks @jugglinmike for the detailed analysis and suggestions. The main problem to solve here is indeed the very incomplete runs, and setting the bar at 98% sounds like a great starting point. On the terminology, I have in the past insisted on calling the individual testharness.js test, async_test and promise_test "instances" simply tests, and the files just files, but in the face of multi-global tests and even just plain reftests this terminology doesn't work. So I'm going to embrace "subtest" and hope we can perhaps be consistent in code and documentation as well, in the end. |
Related to #233.
We will most likely keep having runs that aren't complete. Before a run is treated as valid for exposing on the front page, we should have some checks that catch obviously broken things, like:
Any condition we pick will sometimes be violated for good reasons, so some manual override will be needed. Alternatively, we could serve the results but show a warning and auto-file a GitHub issue to investigate.
The text was updated successfully, but these errors were encountered: