-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Score "Investigate" progress as part of the overall metric #49
Comments
I think this makes sense, and 15% is a reasonable portion of the score. I do think spelling out the scoring of these areas ahead of time might be challenging, in particular for Pointer Events where the issue was that existing failures didn't seem representative of issues that web developers might face, but do we know in any detail what things do need to be tested? For @jensimmons what do you think of this proposal, in particular as regards #41? |
I also think it's ok, and good to pay attention to these research tasks in addition to fixing bugs. My suggestion is to start with a 0 score in this area for all three browser implementations and then increase it at the end of each quarter if progress has been made, via a consensus decision. I think each research group can and should define its own agenda and ways of organizing; it's not necessary for the whole group to pre-agree on these plans. The group can then review the results at the end of the quarter and give a reasonable score. |
I think this makes sense to me for Viewports, perhaps we make the (few) automated tests be 50%, and the manual testing that we plan & do be 50%. Perhaps? We won't be manually able to test over & over, however. So I'm open for debate on this. |
I have a serious problem with reopening and re-litigating decisions already made by changing the process after the fact. |
At this stage, we have a number of concerns about substantially changing the scoring approach, especially when we are so close to the proposed announcement date. This isn’t to say we don’t think there’s substantial value in documenting progress on the things marked as investigate, but I think we’d need a much more concrete proposal as to how the scoring would affect the overall scores. It’s unclear how the Investigate items will be scored, but presuming that gets figured out, it’s still very unclear how those scores will impact the overall score for each browser. Is the proposal that the resulting score be applied to each browser’s total in an equal fashion? Or that browsers will be able to earn more points than others by participating in the standards and testing work needed? If the proposal is to have the investigation scoring apply equally to all browsers, it’s not clear it provides much value aside from making it harder for everyone to reach 100%; it lacks the competitive pressure that the metric otherwise provides. If the proposal is to have investigation scoring apply differently to each browser, based on some sort of measure of participation or contribution, then much more detail would be needed as to how to this would be measured. In either case, we are dubious that consensus about how each investigation item is scored can be a reached in time for the announcement date, and we very strongly want any discussions about scoring to be concluded by that date. Our preference, as far as 2022 is concerned—and we can develop further proposals over the coming year for how we want to score investigate items in future—is to list the investigation as a separate item on the dashboard, aside from the browsers and their scores. In that regard, we’re much less concerned about ongoing discussion about the scoring of the items, provided we believe we can reach consensus within the first quarter of the year. |
As a point of clarity, the proposal is to have a uniform score that applies to all implementations. Investigation work intrinsically requires collaboration and so it makes sense to score it in a way that rewards everyone for collective progress. That does come with the tradeoff that you can't use that score in a "competitive" sense against other implementations. But you can use it to demonstrate that you've made good on a commitment to improving the web ecosystem as a whole. Although competitive pressure is certainly one way that the web makes progress, it's not the only way, and scoring an interoperability metric entirely on the basis of what specific implementations acomplish in terms of landing new feature work is missing the bigger picture. |
This is implemented in https://staging.wpt.fyi/interop-2022 now, in all places except one. In the summary graph, I've forgotten to scale score to 90%. Leaving this issue as a reminder of that... |
I fixed the summary graph at some point, closing this as fixed. |
Mozilla are concerned that the score being based purely on the test pass rate for the accepted proposals doesn't provide any visible reward for progress on the areas that were marked as "Investigate". Although these are not at the point where we can define a set of tests for implementors to target, several of the "Investigate" areas are places where we see a lot of compatability problems in practice, and completing the pre-implementation work required to address those problems will pay dividends in terms of improving the experience of the web for end users and authors. Therefore that work should be recognised alongside implementation work in terms of improving the Interop-2022 score.
Given this we propose the following:
For example, we might decide that 85% of the total score comes from the test pass rate, and 15% of the score comes from progress on the "Investigate" areas. If we completed two thirds of the investigate work, but failed to make progress on the final third, that would mean the most that any one implementation could score on the overall metric would be 95%.
Concretely, I think the areas we marked as "Investigate" that should form part of this proposal are:
I've excluded AVIF because, as I understand it, the scope of "Investigate" there wasn't about the state of the spec or tests, and accent-color because the concern there seemed to be that there weren't actually any failing tests.
In terms of scoring I don't think that all these areas necessarily need to be given equal weight, or that each "Investigate" area has to be exactly equal to an implementation area.
The text was updated successfully, but these errors were encountered: