-
-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalize assignments: Chapter 3. Markup #5
Comments
@bkardell can you think of anyone who might be interested in reviewing this chapter? |
I'm not sure if this is the best place to post a suggestion, but I think it would be very interesting to chronicle the advent of non-semantic HTML that's essentially auto-generated by build tools (think the project Twitter's using to take React Native code and build it for the web…resulting in div/span tag soup. Example here: https://twitter.com/jaredcwhite/status/1090283063320276992). This isn't new of course—I remember "tag soup" tool-generated HTML being a thing since the 90s, but it feels like that sort of went away in the HTML5 era and now it's rearing its (IMHO) ugly head again. |
@zcorpan would you be interested in reviewing this chapter? @jaredcwhite +1 I think that's a great idea. Is that something that would require looking back at older datasets or do you think it'd be sufficient to look at the current dataset and do something like measure the proportion of div/span against all tags? We're adding instrumentation in HTTPArchive/legacy.httparchive.org#159 to extract tags from the document, so this will only be something we can get easily going forward. Also, we'd love to have you as a reviewer if you're up for it! |
Looking up context I checked the readme of this repo, and also found https://discuss.httparchive.org/t/planning-the-web-almanac-2019/1553 Exciting! I'm happy to review. Is the thing that needs review written yet? If so, where? When does the review need to be done? |
Great! Glad to have you on board. The current status is that we're planning which metrics to include in each chapter. If you have any ideas or feedback we'd love to hear them. Hoping to lock the metrics down by June 3. The writing phase will start in ~August. |
Ideas:
|
Nice! I see that I never suggested @zcorpan here apparently, but I know I did somewhere! Glad this worked out as I think he'd have been a better author even 😁 |
Another idea for a metric, though I don't know if it would be difficult to implement:
One way would be to add use counters for each parse error in Chromium's HTML parser. Another way (likely simpler) would be to run the response body through an HTML parser that can log parse errors. |
Yeah that sounds good. Use counters would actually be the easiest thing, from the analysis perspective. Any kind of secondary data pass would be adding a lot of complexity. |
OK, though it might not be trivial to add error reporting to Chromium's HTML parser since it currently doesn't care about any errors (I believe). For example, it could require adding new branches in the state machine to be able to differentiate between things that are errors vs non-errors but otherwise have the same effect. So there's a risk of regressing performance and risk of introducing new bugs to the parser, on top of just implementing the error reporting. To make it more worthwhile, maybe we could check if browser devtools would also want to make use of HTML parse errors? (Firefox highlights parse errors in View Source, but doesn't show them in devtools, AFAIK.) |
@bkardell could you sign off on this chapter? If you think we have enough metrics and reviewers you can close this issue. |
I filed https://bugs.chromium.org/p/chromium/issues/detail?id=971851 about implementing HTML parse error reporting. |
Thanks @zcorpan! |
lgtm. |
@zcorpan @bkardell Does the metric "Attribute usage (stretch goal)" (Page Content/Markup) refer to the usage of HTML attributes. In that case we may be able to find out distribution (https://discuss.httparchive.org/t/usage-of-aria-attributes/778) |
@rviscomi @zcorpan @bkardell going ahead (as the July crawl is scheduled to start) with the "list of valid HTML attributes" interpretation for the metric "Attribute usage (stretch goal)" (Page Content/Markup)". In that case, we should be able to extract from response_bodies. |
The new custom metrics for markup collect data from the parsed tree is the thing so you wind up with far less data to deal with and far more accurate than a regexp across HTML. We had discussed whether somehow it would make sense to do the same for attributes but given all the questions people are interested in asking about them and their relationship to tags or urls or parent elements or.. Whatever.. Even what to collect was unclear. I had proposed a few potentials I think but I believe we were worried about this exploding the size, defeating the purpose or just being not that useful. |
@rviscomi was a bit of struggle to understand adding custom metric but was able to finally test a sample/almanac.js using instructions in #33. |
Some more context here: https://discuss.httparchive.org/t/use-of-custom-elements-with-attributes/1592 and here: HTTPArchive/httparchive.org#138 I'd recommend marking this one as Not Feasible due to the complexity. |
* start traduction * process trad * # This is a combination of 9 commits. # This is the 1st commit message: update # The commit message #2 will be skipped: # review # The commit message #3 will be skipped: # review #2 # The commit message #4 will be skipped: # advance # The commit message #5 will be skipped: # update # The commit message #6 will be skipped: # update translation # The commit message #7 will be skipped: # update # The commit message #8 will be skipped: # update # # update # The commit message #9 will be skipped: # update * First quick review (typofixes, translating alternatives) * Preserve original line numbers To facilitate the review of original text vs. translation side-by-side. Also: microtypo fixes. * Review => l338 * End of fine review * Adding @allemas to translators * Rename mise-en-cache to caching * final updates * update accessibility * merge line * Update src/content/fr/2019/caching.md Co-Authored-By: Barry Pollard <[email protected]> * Update src/content/fr/2019/caching.md If it's not too much effort, could you also fix this in the English version as part of this PR as looks wrong there: 6% of requests have a time to time (TTL) should be: 6% of requests have a Time to Live (TTL) Co-Authored-By: Barry Pollard <[email protected]> * Update src/content/fr/2019/caching.md Do we need to state that all the directives are English language terms or is that overkill? If so need to check this doesn't mess up the markdown->HTML script. Co-Authored-By: Barry Pollard <[email protected]> Co-authored-by: Boris SCHAPIRA <[email protected]> Co-authored-by: Barry Pollard <[email protected]>
* start traduction * process trad * # This is a combination of 9 commits. # This is the 1st commit message: update # The commit message #2 will be skipped: # review # The commit message #3 will be skipped: # review #2 # The commit message #4 will be skipped: # advance # The commit message #5 will be skipped: # update # The commit message #6 will be skipped: # update translation # The commit message #7 will be skipped: # update # The commit message #8 will be skipped: # update # # update # The commit message #9 will be skipped: # update * First quick review (typofixes, translating alternatives) * Preserve original line numbers To facilitate the review of original text vs. translation side-by-side. Also: microtypo fixes. * Review => l338 * End of fine review * Adding @allemas to translators * Rename mise-en-cache to caching * final updates * update accessibility * merge line * Update src/content/fr/2019/caching.md Co-Authored-By: Barry Pollard <[email protected]> * Update src/content/fr/2019/caching.md If it's not too much effort, could you also fix this in the English version as part of this PR as looks wrong there: 6% of requests have a time to time (TTL) should be: 6% of requests have a Time to Live (TTL) Co-Authored-By: Barry Pollard <[email protected]> * Update src/content/fr/2019/caching.md Do we need to state that all the directives are English language terms or is that overkill? If so need to check this doesn't mess up the markdown->HTML script. Co-Authored-By: Barry Pollard <[email protected]> * generate caching content in french * Update src/content/fr/2019/caching.md Co-Authored-By: Barry Pollard <[email protected]> * Update src/content/fr/2019/caching.md Co-Authored-By: Barry Pollard <[email protected]> Co-authored-by: Boris SCHAPIRA <[email protected]> Co-authored-by: Barry Pollard <[email protected]>
Due date: To help us stay on schedule, please complete the action items in this issue by June 3.
To do:
Current list of metrics:
👉AI (@bkardell): Assign peer reviewers. These are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have 2 or more reviewers who can promote a diversity of perspectives.
👉 AI (@bkardell): Finalize which metrics you might like to include in an annual "state of markup" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.
The metrics should paint a holistic, data-driven picture of the markup landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.
Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.
Additional resources:
The text was updated successfully, but these errors were encountered: