Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markup 2020 TODOs #1614

Merged
merged 37 commits into from
Dec 7, 2020
Merged
Changes from 22 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
d582cf8
Merge pull request #1 from HTTPArchive/main
j9t Nov 8, 2020
ee4c97e
Merge pull request #2 from HTTPArchive/main
j9t Nov 8, 2020
b7846b4
Merge pull request #3 from HTTPArchive/main
j9t Nov 9, 2020
50cd23b
Merge pull request #4 from HTTPArchive/main
j9t Nov 11, 2020
51f03d6
Merge pull request #5 from HTTPArchive/main
j9t Nov 12, 2020
87041d0
Merge pull request #6 from HTTPArchive/main
j9t Nov 13, 2020
8c9e781
Merge pull request #7 from HTTPArchive/main
j9t Nov 15, 2020
69ca37b
Merge pull request #8 from HTTPArchive/main
j9t Nov 16, 2020
66e28ea
chore: review and address TODOs
j9t Nov 16, 2020
750c94b
Merge remote-tracking branch 'origin/main' into main
j9t Nov 16, 2020
4af6840
Merge pull request #9 from HTTPArchive/main
j9t Nov 20, 2020
980e21f
Merge pull request #10 from HTTPArchive/main
j9t Nov 21, 2020
079d2ec
Update src/content/en/2020/markup.md
j9t Nov 22, 2020
0c708bf
Update src/content/en/2020/markup.md
j9t Nov 22, 2020
0da6a62
Merge pull request #11 from HTTPArchive/main
j9t Nov 22, 2020
c7d99ee
Merge pull request #12 from HTTPArchive/main
j9t Nov 24, 2020
37659b4
Merge pull request #13 from HTTPArchive/main
j9t Nov 25, 2020
abaa49c
Merge pull request #14 from HTTPArchive/main
j9t Nov 27, 2020
2b0dba1
Merge pull request #15 from HTTPArchive/main
j9t Nov 29, 2020
9fa288e
Merge pull request #16 from HTTPArchive/main
j9t Dec 1, 2020
5514c08
Merge pull request #17 from HTTPArchive/main
j9t Dec 2, 2020
5ba7822
chore: address TODOs
j9t Dec 2, 2020
39d61e6
chore: update numbers to divide bytes by 1,024 (instead of 1,000)
j9t Dec 3, 2020
79d36b9
Merge pull request #18 from HTTPArchive/main
j9t Dec 3, 2020
c93c67b
Merge remote-tracking branch 'origin/main' into main
j9t Dec 3, 2020
b157073
Merge pull request #19 from HTTPArchive/main
j9t Dec 5, 2020
dbf3a26
chore: update `lang` section wording (via @bazzadp)
j9t Dec 5, 2020
f261f63
chore: compress image
j9t Dec 5, 2020
13d6c58
chore: align graph title and name with other charts
j9t Dec 5, 2020
bca5e67
chore: compress image
j9t Dec 5, 2020
7475b53
Merge pull request #20 from HTTPArchive/main
j9t Dec 6, 2020
f6e171c
docs: add note on little popular elements
j9t Dec 6, 2020
88c936a
chore: remove “unedited” flag
j9t Dec 6, 2020
2def826
Update src/content/en/2020/markup.md
rviscomi Dec 6, 2020
ccf22c0
chore: correct number (per @Tiggerito)
j9t Dec 7, 2020
570a0c3
Merge pull request #21 from HTTPArchive/main
j9t Dec 7, 2020
d3f2d19
Merge remote-tracking branch 'origin/main' into main
j9t Dec 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 7 additions & 17 deletions src/content/en/2020/markup.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,7 @@ A page's document size refers to the amount of HTML bytes transferred over the n
* The largest document by far weighs 64.16 _MB_, almost deserving its own analysis and chapter in the Web Almanac.

{# TODO(analysts): Should 25,237 bytes be divided by 1000 or 1024 to convert to KB? 1000 seems to be used here but most chapters use 1024. Are the stats above also off? #}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, 1024 should be used. I'm not confident on the above stats. I'd like to use an example we can prove. I think the problem was the bytesHtmlDoc which I think it more true to what we would call the document got cut off at 16,777,215.

I used Screaming Frog to crawl all the ones in the list, and double checked the big ones in Chrome. Drum roll...

https://www.linkshops.com/ contains a js bundle that is 28.1MB.

https://www.boonterm.com/web/index1.php has 22.8MB of html (embedded images) and references a 34.7MB mp4 file.

https://www.aci.edu.sg/ has 33.7MB of html and took me over a minute to load. Also embedded images.

Tiggerito marked this conversation as resolved.
Show resolved Hide resolved
{# TODO(authors): What's the implication and your interpretation of this value? Is this a surprisingly big number? Or does it align with your expectations? #}
How is this situation in general, then? The median document weighs 25.24 KB:
How is this situation in general, then? The median document weighs 25.24 KB, which comes [without surprises](https://httparchive.org/reports/page-weight):

{{ figure_markup(
image="document-size.png",
Expand Down Expand Up @@ -121,8 +120,6 @@ Here are the 10 most popular (normalized) languages in our sample. At first we c
<figcaption>{{ figure_link(caption="Top 10 <code>lang</code> attribute values.", sheets_gid="2047285366", sql_file="pages_almanac_by_device_and_html_lang.sql") }}</figcaption>
</figure>

{# TODO(authors): Add an interpretation of the lang results. #}
Copy link
Member Author

@j9t j9t Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Background for removal: I’d argue there’s little to interpret, or to defer to methodology as anything we see here may have been introduced on that end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What conclusions would you like to see readers draw from this figure? If there's nothing to say about it, do we need it at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may raise a very good point here! If the data set suggests that this data may not be representative for the wider Web then we should probably take this out, because what is there to see?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dat set is crawled from a US crawler with en-US set as the preferred language. Not sure how common it is to redirect to a home page based on that locale?

Additionally, the set of URLs is based on the CrUX data based off of Chrome data - is that swayed towards western users (therefore explaining the low numbers of Asia sites other than Japanese)?

Therefore I think it raises a very good question as to whether we can rely on this data? At the very least we should add a caveat explaining these influences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember the point of this section was to help the readers better understand the data set they're looking at: e.g. where do most of the pages come from and what is the main language used/detected on them.

In my opinion, I think a pie chart with en, en-us, ja, es etc would do it here. I'd also add in the chart the 22.36% of all documents that specify no lang attribute.

Copy link
Member

@tunetheweb tunetheweb Dec 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like I can make suggestions on a deleted line, but I would suggest something like this:

Here are the 10 most popular (normalized on case) languages in our sample. It is important to note that the HTTP Archive crawls from US data centres with English language settings, so looking at the language pages are written in will be skewed towards English. Nevertheless we present the lang attributes seen to give some context to the sites analysed.

{{ figure_markup(
  image="top-html-lang.png",
  alt="The top HTML lang attritbues.",
  caption="The top HTML `lang` attritutes.",
  description="Bar chart showing the top 10 `lang` attributes using in our crawl with 22.82% of desktop and 22.36% of mobile not setting this, `en` being used on 20.09% and 18.08% respectively, `ja` on 15.17% and 13.27%, `es` on 4.86% and 4.09% , `pt-br` on 2.65% and 2.84%, `ru` on 2.21% 2.53%, `en-gb` on 2.35% and 2.19%, `de` on 1.50% and 1.92%, and finally `fr` being used on 1.55% and 1.43% respectively",
  sheets_gid="2047285366",
  chart_url="https://docs.google.com/spreadsheets/d/e/2PACX-1vQPKzFb574UnGTcfw5mcD1qR7RYHyGjQTc2hiMuYix0QoTH1DPe54Q2JucXL8bfZ6kjRoAfhk3ckudc/pubchart?oid=1873310240&format=interactive",
  width=600,
  height=371,
  sql_file="pages_almanac_by_device_and_html_lang.sql"
  )
}}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the page (with just minor edits).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to remove SGTM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove? Or keep the update? (We had updated this section.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the removal of the table. Either way, LGTM.


### Comments

Adding comments to code is generally a good practice and HTML comments are there to add notes to HTML documents, without having them rendered by user agents.
Expand All @@ -131,9 +128,7 @@ Adding comments to code is generally a good practice and HTML comments are there
<!-- This is a comment in HTML -->
```

Although many pages will have been stripped of comments for production, we found that index pages in the 90th percentile are using about 73 comments on mobile, respectively 79 comments on desktop, while in the 10th percentile the number of the comments is about 2.

{# TODO(authors): How about the median number for a typical website? #}
Although many pages will have been stripped of comments for production, we found that index pages in the 90th percentile are using about 73 comments on mobile, respectively 79 comments on desktop, while in the 10th percentile the number of the comments is about 2. The median page uses 16 (mobile) or 17 comments (desktop).

Around 89% of pages contain at least one HTML comment, while about 46% of them contain a conditional comment.

Expand All @@ -151,7 +146,7 @@ Still, on the above percentile extremes, we found that web pages are using about

For production, HTML comments are usually stripped by build tools. Considering all the above counts and percentages, and referring to the use of comments in general, we suppose that lots of pages are served without involving an HTML minifier.

### Script use
### Script use

As shown in the [Top elements](#top-elements) section below, the `script` element is the 6th most frequently used HTML element. For the purposes of this chapter, we were interested in the ways the `script` element is used across these millions of pages from the data set.

Expand All @@ -173,7 +168,7 @@ At the opposite end of the spectrum, the numbers show that about 97% of pages co
)
}}

When scripting is unsupported or turned off in the browser, the `noscript` element helps to add an HTML section within a page. Considering the above script numbers, we were curious about the `noscript` element as well.
When scripting is unsupported or turned off in the browser, the `noscript` element helps to add an HTML section within a page. Considering the above script numbers, we were curious about the `noscript` element as well.

Following the analysis, we found that about 49% of pages are using a `noscript` element. At the same time, about 16% of `noscript` elements were containing an `iframe` with a `src` value referring to "googletagmanager.com".

Expand All @@ -183,13 +178,11 @@ This seems to confirm the theory that the total number of `noscript` elements in

What `type` attribute values are used with `script` elements?

{# TODO(authors, analysts): Should this be a figure? #}
rviscomi marked this conversation as resolved.
Show resolved Hide resolved
{# TODO(authors): Explain the significance of the "!" in text. #}
- `text/javascript`: 60.03%
- `application/ld+json`: 1.68%
- `application/json`: 0.41%
- `text/template`: 0.41%
- `text/html` (!) 0.27%
- `text/html` 0.27%

When it comes to loading [JavaScript module scripts](https://jakearchibald.com/2017/es-modules-in-browsers/) using `type="module"`, we found that 0.13% of `script` elements currently specify this attribute-value combination. `nomodule` is used by 0.95% of all tested pages. (Note that one metric relates to elements, the other to pages.)

Expand Down Expand Up @@ -352,13 +345,11 @@ Standard elements are those that are or were part of the HTML specification. Whi
<figcaption>{{ figure_link(caption="Low probabilities of finding a given element in pages of the sample.", sheets_gid="184700688", sql_file="pages_element_count_by_device_and_element_type_present.sql") }}</figcaption>
</figure>

{# TODO(authors): Interpet results. #}
Copy link
Member Author

@j9t j9t Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Background: Suggesting to skip this, or ask whether @catalinred or @iandevlin like to draft something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I’d look at @catalinred or @iandevlin whether they like to expand on this.

Other than that, as opposed to the lang section here I don’t agree. Unless you’re coming from an Almanac convention that requires every data point to be interpreted, I wouldn’t be convinced that data can’t stand by itself. I’d suggest that not only are our readers capable of drawing conclusions, but that this can even be refreshing from an editorial perspective, if not to suggest—“wait a minute; what’s this, what does this mean?”

If I misunderstand you, please let me know, otherwise I’d appreciate a bit of leeway here for us to decide on what also not to interpret.

Copy link
Member

@tunetheweb tunetheweb Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is we should be selective to what we put in the chapter. We have a lot of data - some of it will be interesting and some not. I think we need to pick and choose what to include to keep the chapter interesting.

The point of the almanac is "The Web Almanac is a comprehensive report on the state of the web, backed by real data and trusted web experts" rather than "a list of stats, with experts interpreting them".

Saying that, I think these particular stats ARE interesting. Why are they not used? Are they old elements which are not useful? Have they been replaced by better alternatives? Or are they new elements that haven't taken off yet?

Looking at MDN two if them are obsoleted (dir and basefont) and one (rb) is Ruby specific - to me there's a question if that should really be a standard HTML element since it's Ruby specific? Ultimately, I'd never heard of these particular elements and had to look them up to figure this information out, so think a summary of this explaining this would be useful here to save other readers doing the same.

However, I do think this is the authors work, so if you still feel this is not necessary explanation and the stats stand on their own, then I can accept this. But think it's right for @rviscomi to at least ask these questions in case you hadn't considered that interpretation or were making assumptions as to what readers might know because you are HTML experts (and other readers might not be).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to skip SGTM. Could you update the md to remove the content?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually felt like I want to add info here. Added a paragraph. What do you think?


### Custom elements

The 2019 edition of the Web Almanac handled [custom elements](../2019/markup#custom-elements) by discussing several non-standard elements. This year, we found it valuable to have a closer look at custom elements. How did we determine these? Roughly by looking at [their definition](https://html.spec.whatwg.org/multipage/custom-elements.html#custom-elements-core-concepts), notably their use of a hyphen. Let's focus on the top elements, in this case elements used on ≥1% of all URLs in the sample:

{# TODO(authors, analysts): Clarify occurrences and percentages _of what_. Pages? Elements? And for desktop or mobile? #}
{# TODO(authors, analysts): Clarify occurrences and percentages _of what_. Pages? Elements? #}
rviscomi marked this conversation as resolved.
Show resolved Hide resolved

<figure markdown>
| Element | Occurrences | Percentage |
Expand Down Expand Up @@ -697,7 +688,7 @@ Using `target="_blank"` has been known to be a [security vulnerability](https://
<figcaption>{{ figure_link(caption="Blank relationships.", sheets_gid="1876528165", sql_file="pages_wpt_bodies_by_device.sql") }}</figcaption>
</figure>

As a rule of thumb and for [usability reasons](https://www.nngroup.com/articles/new-browser-windows-and-tabs/), prefer not to use `target="_blank"` in the first place.
As a rule of thumb and for [usability reasons](https://www.nngroup.com/articles/new-browser-windows-and-tabs/), prefer not to use `target="_blank"` in the first place.

<p class="note">Within the latest Safari and Firefox versions, setting <code>target="_blank"</code> on <code>a</code> elements implicitly provides the same <code>rel</code> behavior as setting <code>rel="noopener"</code>. This is already <a href="https://chromium-review.googlesource.com/c/chromium/src/+/1630010">implemented in Chromium</a> as well and will land in Chrome 88.</p>

Expand All @@ -713,7 +704,6 @@ We've touched on some observations throughout the chapter, but as a reflection o
sql_file="summary_pages_by_device_and_doctype.sql"
) }}

{# TODO(authors): Changed Simon's quote to a paraphrase, since it's not clear which part is verbatim. If there's a quote, let's wrap it in quotes. #}
Fewer pages land in quirks mode. In 2016, that number was at [around 7.4%](https://discuss.httparchive.org/t/how-many-and-which-pages-are-in-quirks-mode/777). At the end of 2019, we observed [4.85%](https://twitter.com/zcorpan/status/1205242913908838400). And now, we're at about 3.97%. This trend, to paraphrase [Simon Pieters](./contributors#zcorpan) in his review of this chapter, seems clear and encouraging.

Although we lack historic data to draw the full development picture, "meaningless" `div`, `span`, and `i` markup has pretty much [replaced](#top-elements) the `table` markup we've observed in the 1990s and early 2000s. While one may question whether `div` and `span` elements are always used without there being a semantically more appropriate alternative, these elements are still preferable to `table` markup, though, as during the heyday of the old web, these were seemingly used for everything but tabular data.
Expand Down