-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify / clean-up Interop 2021 labeling #46
Comments
@jensimmons thanks for filing the issue. I have taken a look at the number of tests used in each scoring script, and it turns out my suspicion was wrong. The exact same number of tests are going into it for both scripts:
I didn't check that the test names are the same, but I now doubt a test mismatch is the right explanation here. I'll self-assign this and look into if the scores end up being different even for the exact same runs, which I haven't confirmed yet. |
I have taken a close look at this. The 4 data files that get loaded by https://wpt.fyi/compat2021 are:
The summary files have the same data as the last entry in the unified scores files. Here's the data from
That data was from 2022-01-31 (sha 29ce70d915) so that's what I've compared. Here are the summary numbers I get from the new script, from the same commit:
All the numbers seem to match. Next I'll look at the total score and explain how it was computed for Compat 2021 and check if it'll match for Interop 2022. |
Here's how the summary score was computed for Compat 2021: I can get the same scores as are shown on the Compat 2021 dashboard like this: let chrome = [0.9935572941, 0.9784480375, 0.9805697246, 0.9811258278, 1];
let firefox = [0.970230608, 0.9904512234, 0.9127763988, 0.9614199965, 0.8928571429];
let safari = [0.9650135108, 0.948549192, 0.9631824371, 0.8872781623, 1];
function sum(list) { return list.reduce((acc, x) => acc + x, 0); }
sum(chrome.map(score => Math.floor(score * 20))); // 96%
sum(firefox.map(score => Math.floor(score * 20))); // 92%
sum(safari.map(score => Math.floor(score * 20))); // 93% So the way this worked, because each area was first scored 0-20, the only way to increase the score was to pass a 5% threshold in an individual area. This is something I'm proposing to change. If we were to score just these 5 areas using the method I'm suggesting, it would be: let chrome = [993, 978, 980, 981, 1000];
let firefox = [970, 990, 912, 961, 892];
let safari = [965, 948, 963, 887, 1000];
function sum(list) { return list.reduce((acc, x) => acc + x, 0); }
Math.floor(sum(chrome) / 5); // 986 i.e. 98.6%
Math.floor(sum(firefox) / 5); // 945 i.e. 94.5%
Math.floor(sum(safari) / 5); // 952 i.e. 95.2% I believe this is better since smaller improvements in the individual area scores get reflected in the overall score. Note that this isn't just due to adding a decimal point (which we can debate) but also because we get rid of the truncation to a 0-20 score for each area. |
One thing remains to be explained here:
Taking this Safari run as the example since it's what I used in the previous comments, there are 755 tests. The Safari score in our metrics are:
So this is mainly explained by that truncation to a 0-20 score per area. Unfortunately, it's hard to read that 88.7% from wpt.fyi, because wpt.fyi can't show normalized scores, where each test counts the same regardless of number of subtests. This is a feature request in web-platform-tests/wpt.fyi#2290 and I think we might get to it this year, but not before the launch of Interop 2022. It is possible to verify the score with careful counting of tests and subtests, however. |
Interop 2021 now has labels for the tests:
@foolip pulled data (published here: https://gist.github.com/foolip/25c9ed482a0dd802f9bf2eea4544ccac )
(Where 993 = 99.3% pass rate, based on weighted calculations.)
Meanwhile, the Compat 2021 dashboard shows the following scores:
What I don't understand is how the labeled tests, given the pass rates quoted above, translate into the points given. There seems to be another layer of computation that's happening.
For example, let's look at Safari's transform score. We are passing 847 of 1000 tests, which is 85%. Yet, we have a score of 16, which is the equivalent of 80%. Applying this across all the tests, and it seems both Safari and Firefox are being underscored.
Philip you mentioned in the meeting perhaps this is happening because the wrong tests are labeled?
The text was updated successfully, but these errors were encountered: