Clarify / clean-up Interop 2021 labeling #46

jensimmons · 2022-01-18T23:59:09Z

Interop 2021 now has labels for the tests:

@foolip pulled data (published here: https://gist.github.com/foolip/25c9ed482a0dd802f9bf2eea4544ccac )

feature	chrome	firefox	safari
interop-2021-aspect-ratio	993	970	964
interop-2021-flexbox	978	989	942
interop-2021-grid	978	912	963
interop-2021-position-sticky	1000	892	1000
interop-2021-transforms	974	961	847

(Where 993 = 99.3% pass rate, based on weighted calculations.)

Meanwhile, the Compat 2021 dashboard shows the following scores:

feature	chrome	firefox	safari
aspect-ratio	19	19	19
flexbox	19	19	18
grid	19	18	19
position-sticky	20	17	20
transforms	19	19	16
total	96	92	92

What I don't understand is how the labeled tests, given the pass rates quoted above, translate into the points given. There seems to be another layer of computation that's happening.

For example, let's look at Safari's transform score. We are passing 847 of 1000 tests, which is 85%. Yet, we have a score of 16, which is the equivalent of 80%. Applying this across all the tests, and it seems both Safari and Firefox are being underscored.

Philip you mentioned in the meeting perhaps this is happening because the wrong tests are labeled?

foolip · 2022-01-19T14:00:51Z

@jensimmons thanks for filing the issue. I have taken a look at the number of tests used in each scoring script, and it turns out my suspicion was wrong. The exact same number of tests are going into it for both scripts:

feature	tests
aspect-ratio	159
css-flexbox	1051
css-grid	902
css-transforms	755
position-sticky	42

I didn't check that the test names are the same, but I now doubt a test mismatch is the right explanation here. I'll self-assign this and look into if the scores end up being different even for the exact same runs, which I haven't confirmed yet.

foolip · 2022-02-03T10:51:09Z

I have taken a close look at this. The 4 data files that get loaded by https://wpt.fyi/compat2021 are:

The summary files have the same data as the last entry in the unified scores files.

Here's the data from summary-experimental.csv:

feature	chrome	firefox	safari
aspect-ratio	0.9935572940635067	0.9702306079664571	0.9650135108027422
css-flexbox	0.9784480375042816	0.990451223387353	0.9485491920429956
css-grid	0.9805697246403723	0.9127763987836662	0.9631824371311336
css-transforms	0.9811258278145696	0.9614199965275193	0.8872781623473189
position-sticky	1	0.8928571428571429	1

That data was from 2022-01-31 (sha 29ce70d915) so that's what I've compared.

Here are the summary numbers I get from the new script, from the same commit:

feature	chrome	firefox	safari
interop-2021-aspect-ratio	993	970	965
interop-2021-flexbox	978	990	948
interop-2021-grid	980	912	963
interop-2021-position-sticky	1000	892	1000
interop-2021-transforms	981	961	887
interop-2022-cascade	965	837	777
interop-2022-color	467	521	912
interop-2022-contain	953	842	885
interop-2022-dialog	984	892	902
interop-2022-forms	767	732	547
interop-2022-scrolling	920	708	790
interop-2022-subgrid	100	953	100
interop-2022-text	677	965	783
interop-2022-viewport	166	166	1000
interop-2022-webcompat	260	957	495

All the numbers seem to match.

Next I'll look at the total score and explain how it was computed for Compat 2021 and check if it'll match for Interop 2022.

foolip · 2022-02-03T11:28:31Z

Here's how the summary score was computed for Compat 2021:

https://github.com/web-platform-tests/wpt.fyi/blob/39242385b97882de45af97f7713bc8e67eff7564/webapp/components/interop-2022.js#L121-L133

I can get the same scores as are shown on the Compat 2021 dashboard like this:

let chrome =  [0.9935572941, 0.9784480375, 0.9805697246, 0.9811258278, 1];
let firefox = [0.970230608,  0.9904512234, 0.9127763988, 0.9614199965, 0.8928571429];
let safari =  [0.9650135108, 0.948549192,  0.9631824371, 0.8872781623, 1];
function sum(list) { return list.reduce((acc, x) => acc + x, 0); }
sum(chrome.map(score => Math.floor(score * 20))); // 96%
sum(firefox.map(score => Math.floor(score * 20))); // 92%
sum(safari.map(score => Math.floor(score * 20))); // 93%

So the way this worked, because each area was first scored 0-20, the only way to increase the score was to pass a 5% threshold in an individual area. This is something I'm proposing to change.

If we were to score just these 5 areas using the method I'm suggesting, it would be:

let chrome =  [993, 978, 980, 981, 1000];
let firefox = [970, 990, 912, 961, 892];
let safari =  [965, 948, 963, 887, 1000];
function sum(list) { return list.reduce((acc, x) => acc + x, 0); }
Math.floor(sum(chrome) / 5); // 986 i.e. 98.6%
Math.floor(sum(firefox) / 5); // 945 i.e. 94.5%
Math.floor(sum(safari) / 5); // 952 i.e. 95.2%

I believe this is better since smaller improvements in the individual area scores get reflected in the overall score. Note that this isn't just due to adding a decimal point (which we can debate) but also because we get rid of the truncation to a 0-20 score for each area.

foolip · 2022-02-03T11:51:34Z

One thing remains to be explained here:

For example, let's look at Safari's transform score. We are passing 847 of 1000 tests, which is 85%. Yet, we have a score of 16, which is the equivalent of 80%. Applying this across all the tests, and it seems both Safari and Firefox are being underscored.

Taking this Safari run as the example since it's what I used in the previous comments, there are 755 tests. The Safari score in our metrics are:

0.8872781623 as computed for Compat 2021
17 for the overall Compat 2021 metric, from Math.floor(0.8872781623 * 20), which is 85%
88.7% in the way I'm suggested we compute metrics for Interop 2022

So this is mainly explained by that truncation to a 0-20 score per area.

Unfortunately, it's hard to read that 88.7% from wpt.fyi, because wpt.fyi can't show normalized scores, where each test counts the same regardless of number of subtests. This is a feature request in web-platform-tests/wpt.fyi#2290 and I think we might get to it this year, but not before the launch of Interop 2022. It is possible to verify the score with careful counting of tests and subtests, however.

jensimmons added the proposal label Jan 18, 2022

foolip self-assigned this Jan 19, 2022

foolip mentioned this issue Jan 26, 2022

Agenda for Jan 27 meeting #50

Closed

This was referenced Feb 3, 2022

Agenda for Feb 3 meeting #54

Closed

Agenda for Feb 10 meeting #56

Closed

foolip closed this as completed Feb 10, 2022

foolip mentioned this issue Feb 16, 2022

Agenda for Feb 17 meeting #62

Closed

foolip removed the proposal label Apr 1, 2022

gsnedders added the meta Process and/or repo issues label Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify / clean-up Interop 2021 labeling #46

Clarify / clean-up Interop 2021 labeling #46

jensimmons commented Jan 18, 2022

foolip commented Jan 19, 2022

foolip commented Feb 3, 2022 •

edited

Loading

foolip commented Feb 3, 2022

foolip commented Feb 3, 2022

Clarify / clean-up Interop 2021 labeling #46

Clarify / clean-up Interop 2021 labeling #46

Comments

jensimmons commented Jan 18, 2022

foolip commented Jan 19, 2022

foolip commented Feb 3, 2022 • edited Loading

foolip commented Feb 3, 2022

foolip commented Feb 3, 2022

foolip commented Feb 3, 2022 •

edited

Loading