Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize assignments: Chapter 3. Markup #5

Closed
3 tasks done
rviscomi opened this issue May 20, 2019 · 20 comments
Closed
3 tasks done

Finalize assignments: Chapter 3. Markup #5

rviscomi opened this issue May 20, 2019 · 20 comments
Assignees

Comments

@rviscomi
Copy link
Member

rviscomi commented May 20, 2019

Section Chapter Author Reviewers
I. Page Content 3. Markup @bkardell @zcorpan

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter expert (author)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

  • Deprecated elements
  • Popular elements
  • Custom elements (“slang”)
  • Attribute usage (stretch goal)
  • count of shadowRoots

👉AI (@bkardell): Assign peer reviewers. These are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have 2 or more reviewers who can promote a diversity of perspectives.

👉 AI (@bkardell): Finalize which metrics you might like to include in an annual "state of markup" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the markup landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

@rviscomi rviscomi transferred this issue from HTTPArchive/httparchive.org May 21, 2019
@rviscomi rviscomi added this to the Chapter planning complete milestone May 21, 2019
@rviscomi rviscomi changed the title [Web Almanac] Finalize assignments: Chapter 3. Markup Finalize assignments: Chapter 3. Markup May 21, 2019
@rviscomi
Copy link
Member Author

@bkardell can you think of anyone who might be interested in reviewing this chapter?

@jaredcwhite
Copy link

I'm not sure if this is the best place to post a suggestion, but I think it would be very interesting to chronicle the advent of non-semantic HTML that's essentially auto-generated by build tools (think the project Twitter's using to take React Native code and build it for the web…resulting in div/span tag soup. Example here: https://twitter.com/jaredcwhite/status/1090283063320276992). This isn't new of course—I remember "tag soup" tool-generated HTML being a thing since the 90s, but it feels like that sort of went away in the HTML5 era and now it's rearing its (IMHO) ugly head again.

@rviscomi
Copy link
Member Author

rviscomi commented May 24, 2019

@zcorpan would you be interested in reviewing this chapter?

@jaredcwhite +1 I think that's a great idea. Is that something that would require looking back at older datasets or do you think it'd be sufficient to look at the current dataset and do something like measure the proportion of div/span against all tags? We're adding instrumentation in HTTPArchive/legacy.httparchive.org#159 to extract tags from the document, so this will only be something we can get easily going forward. Also, we'd love to have you as a reviewer if you're up for it!

@zcorpan
Copy link
Contributor

zcorpan commented May 24, 2019

Looking up context I checked the readme of this repo, and also found https://discuss.httparchive.org/t/planning-the-web-almanac-2019/1553

Exciting! I'm happy to review.

Is the thing that needs review written yet? If so, where? When does the review need to be done?

@rviscomi
Copy link
Member Author

Great! Glad to have you on board. The current status is that we're planning which metrics to include in each chapter. If you have any ideas or feedback we'd love to hear them. Hoping to lock the metrics down by June 3. The writing phase will start in ~August.

@zcorpan
Copy link
Contributor

zcorpan commented May 24, 2019

Ideas:

  • Quirks mode
  • Character encoding

@bkardell
Copy link
Contributor

Nice! I see that I never suggested @zcorpan here apparently, but I know I did somewhere! Glad this worked out as I think he'd have been a better author even 😁

@zcorpan
Copy link
Contributor

zcorpan commented Jun 3, 2019

Another idea for a metric, though I don't know if it would be difficult to implement:

One way would be to add use counters for each parse error in Chromium's HTML parser. Another way (likely simpler) would be to run the response body through an HTML parser that can log parse errors.

@rviscomi
Copy link
Member Author

rviscomi commented Jun 3, 2019

Yeah that sounds good. Use counters would actually be the easiest thing, from the analysis perspective. Any kind of secondary data pass would be adding a lot of complexity.

@rviscomi
Copy link
Member Author

rviscomi commented Jun 3, 2019

@bkardell hoping to have the list of metrics finalized today. Could you take one last pass through the list of metrics (here or in the doc) and update it with whatever we're missing? We're aiming for 10+ metrics for each chapter.

@zcorpan
Copy link
Contributor

zcorpan commented Jun 4, 2019

OK, though it might not be trivial to add error reporting to Chromium's HTML parser since it currently doesn't care about any errors (I believe). For example, it could require adding new branches in the state machine to be able to differentiate between things that are errors vs non-errors but otherwise have the same effect. So there's a risk of regressing performance and risk of introducing new bugs to the parser, on top of just implementing the error reporting.

To make it more worthwhile, maybe we could check if browser devtools would also want to make use of HTML parse errors? (Firefox highlights parse errors in View Source, but doesn't show them in devtools, AFAIK.)

@rviscomi rviscomi added the ASAP This issue is blocking progress label Jun 6, 2019
@rviscomi
Copy link
Member Author

rviscomi commented Jun 6, 2019

@bkardell could you sign off on this chapter? If you think we have enough metrics and reviewers you can close this issue.

@zcorpan
Copy link
Contributor

zcorpan commented Jun 6, 2019

I filed https://bugs.chromium.org/p/chromium/issues/detail?id=971851 about implementing HTML parse error reporting.

@rviscomi
Copy link
Member Author

rviscomi commented Jun 6, 2019

Thanks @zcorpan!

@bkardell
Copy link
Contributor

bkardell commented Jun 7, 2019

lgtm.

@bkardell bkardell closed this as completed Jun 7, 2019
@rviscomi rviscomi removed ASAP This issue is blocking progress labels Jun 7, 2019
@raghuramakrishnan71
Copy link
Contributor

@zcorpan @bkardell Does the metric "Attribute usage (stretch goal)" (Page Content/Markup) refer to the usage of HTML attributes. In that case we may be able to find out distribution (https://discuss.httparchive.org/t/usage-of-aria-attributes/778)
The same was marked as "Custom Metric Required" as I was not very clear initially.

@raghuramakrishnan71
Copy link
Contributor

raghuramakrishnan71 commented Jun 29, 2019

@rviscomi @zcorpan @bkardell going ahead (as the July crawl is scheduled to start) with the "list of valid HTML attributes" interpretation for the metric "Attribute usage (stretch goal)" (Page Content/Markup)". In that case, we should be able to extract from response_bodies.
Example:

<script type="text/javascript"> ; type is an HTML attribute

@bkardell
Copy link
Contributor

The new custom metrics for markup collect data from the parsed tree is the thing so you wind up with far less data to deal with and far more accurate than a regexp across HTML. We had discussed whether somehow it would make sense to do the same for attributes but given all the questions people are interested in asking about them and their relationship to tags or urls or parent elements or.. Whatever.. Even what to collect was unclear. I had proposed a few potentials I think but I believe we were worried about this exploding the size, defeating the purpose or just being not that useful.

@raghuramakrishnan71
Copy link
Contributor

@rviscomi was a bit of struggle to understand adding custom metric but was able to finally test a sample/almanac.js using instructions in #33.
@bkardell in the custom metric are we looking only at various attributes and their counts or various element+attribute and their counts - you mentioned that you had proposed a few potentials, are the same in some other thread? Not sure i will be able to include them before the July crawl kicks in. but can give it a try.

@rviscomi
Copy link
Member Author

Some more context here: https://discuss.httparchive.org/t/use-of-custom-elements-with-attributes/1592 and here: HTTPArchive/httparchive.org#138

I'd recommend marking this one as Not Feasible due to the complexity.

allemas added a commit that referenced this issue Mar 6, 2020
* start traduction

* process trad

* # This is a combination of 9 commits.
# This is the 1st commit message:

update

# The commit message #2 will be skipped:

# review

# The commit message #3 will be skipped:

# review #2

# The commit message #4 will be skipped:

# advance

# The commit message #5 will be skipped:

# update

# The commit message #6 will be skipped:

# update translation

# The commit message #7 will be skipped:

# update

# The commit message #8 will be skipped:

# update
#
# update

# The commit message #9 will be skipped:

# update

* First quick review

(typofixes, translating alternatives)

* Preserve original line numbers

    To facilitate the review of original text vs. translation side-by-side.

Also: microtypo fixes.

* Review => l338

* End of fine review

* Adding @allemas to translators

* Rename mise-en-cache to caching

* final updates

* update accessibility

* merge line

* Update src/content/fr/2019/caching.md

Co-Authored-By: Barry Pollard <[email protected]>

* Update src/content/fr/2019/caching.md

If it's not too much effort, could you also fix this in the English version as part of this PR as looks wrong there:

6% of requests have a time to time (TTL)

should be:

6% of requests have a Time to Live (TTL)

Co-Authored-By: Barry Pollard <[email protected]>

* Update src/content/fr/2019/caching.md

Do we need to state that all the directives are English language terms or is that overkill? If so need to check this doesn't mess up the markdown->HTML script.

Co-Authored-By: Barry Pollard <[email protected]>

Co-authored-by: Boris SCHAPIRA <[email protected]>
Co-authored-by: Barry Pollard <[email protected]>
tunetheweb added a commit that referenced this issue Mar 6, 2020
* start traduction

* process trad

* # This is a combination of 9 commits.
# This is the 1st commit message:

update

# The commit message #2 will be skipped:

# review

# The commit message #3 will be skipped:

# review #2

# The commit message #4 will be skipped:

# advance

# The commit message #5 will be skipped:

# update

# The commit message #6 will be skipped:

# update translation

# The commit message #7 will be skipped:

# update

# The commit message #8 will be skipped:

# update
#
# update

# The commit message #9 will be skipped:

# update

* First quick review

(typofixes, translating alternatives)

* Preserve original line numbers

    To facilitate the review of original text vs. translation side-by-side.

Also: microtypo fixes.

* Review => l338

* End of fine review

* Adding @allemas to translators

* Rename mise-en-cache to caching

* final updates

* update accessibility

* merge line

* Update src/content/fr/2019/caching.md

Co-Authored-By: Barry Pollard <[email protected]>

* Update src/content/fr/2019/caching.md

If it's not too much effort, could you also fix this in the English version as part of this PR as looks wrong there:

6% of requests have a time to time (TTL)

should be:

6% of requests have a Time to Live (TTL)

Co-Authored-By: Barry Pollard <[email protected]>

* Update src/content/fr/2019/caching.md

Do we need to state that all the directives are English language terms or is that overkill? If so need to check this doesn't mess up the markdown->HTML script.

Co-Authored-By: Barry Pollard <[email protected]>

* generate caching content in french

* Update src/content/fr/2019/caching.md

Co-Authored-By: Barry Pollard <[email protected]>

* Update src/content/fr/2019/caching.md

Co-Authored-By: Barry Pollard <[email protected]>

Co-authored-by: Boris SCHAPIRA <[email protected]>
Co-authored-by: Barry Pollard <[email protected]>
@tunetheweb tunetheweb mentioned this issue Jul 3, 2020
10 tasks
@gregorywolf gregorywolf mentioned this issue Sep 12, 2020
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants