Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markup 2020 TODOs #1614

Merged
merged 37 commits into from
Dec 7, 2020
Merged

Markup 2020 TODOs #1614

merged 37 commits into from
Dec 7, 2020

Conversation

j9t
Copy link
Member

@j9t j9t commented Dec 2, 2020

Progress on #899 #1432

j9t and others added 22 commits November 8, 2020 11:22
Signed-off-by: Jens Oliver Meiert <[email protected]>
Signed-off-by: Jens Oliver Meiert <[email protected]>
@@ -121,8 +120,6 @@ Here are the 10 most popular (normalized) languages in our sample. At first we c
<figcaption>{{ figure_link(caption="Top 10 <code>lang</code> attribute values.", sheets_gid="2047285366", sql_file="pages_almanac_by_device_and_html_lang.sql") }}</figcaption>
</figure>

{# TODO(authors): Add an interpretation of the lang results. #}
Copy link
Member Author

@j9t j9t Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Background for removal: I’d argue there’s little to interpret, or to defer to methodology as anything we see here may have been introduced on that end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What conclusions would you like to see readers draw from this figure? If there's nothing to say about it, do we need it at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may raise a very good point here! If the data set suggests that this data may not be representative for the wider Web then we should probably take this out, because what is there to see?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dat set is crawled from a US crawler with en-US set as the preferred language. Not sure how common it is to redirect to a home page based on that locale?

Additionally, the set of URLs is based on the CrUX data based off of Chrome data - is that swayed towards western users (therefore explaining the low numbers of Asia sites other than Japanese)?

Therefore I think it raises a very good question as to whether we can rely on this data? At the very least we should add a caveat explaining these influences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember the point of this section was to help the readers better understand the data set they're looking at: e.g. where do most of the pages come from and what is the main language used/detected on them.

In my opinion, I think a pie chart with en, en-us, ja, es etc would do it here. I'd also add in the chart the 22.36% of all documents that specify no lang attribute.

Copy link
Member

@tunetheweb tunetheweb Dec 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like I can make suggestions on a deleted line, but I would suggest something like this:

Here are the 10 most popular (normalized on case) languages in our sample. It is important to note that the HTTP Archive crawls from US data centres with English language settings, so looking at the language pages are written in will be skewed towards English. Nevertheless we present the lang attributes seen to give some context to the sites analysed.

{{ figure_markup(
  image="top-html-lang.png",
  alt="The top HTML lang attritbues.",
  caption="The top HTML `lang` attritutes.",
  description="Bar chart showing the top 10 `lang` attributes using in our crawl with 22.82% of desktop and 22.36% of mobile not setting this, `en` being used on 20.09% and 18.08% respectively, `ja` on 15.17% and 13.27%, `es` on 4.86% and 4.09% , `pt-br` on 2.65% and 2.84%, `ru` on 2.21% 2.53%, `en-gb` on 2.35% and 2.19%, `de` on 1.50% and 1.92%, and finally `fr` being used on 1.55% and 1.43% respectively",
  sheets_gid="2047285366",
  chart_url="https://docs.google.com/spreadsheets/d/e/2PACX-1vQPKzFb574UnGTcfw5mcD1qR7RYHyGjQTc2hiMuYix0QoTH1DPe54Q2JucXL8bfZ6kjRoAfhk3ckudc/pubchart?oid=1873310240&format=interactive",
  width=600,
  height=371,
  sql_file="pages_almanac_by_device_and_html_lang.sql"
  )
}}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the page (with just minor edits).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to remove SGTM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove? Or keep the update? (We had updated this section.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the removal of the table. Either way, LGTM.

@@ -352,13 +345,11 @@ Standard elements are those that are or were part of the HTML specification. Whi
<figcaption>{{ figure_link(caption="Low probabilities of finding a given element in pages of the sample.", sheets_gid="184700688", sql_file="pages_element_count_by_device_and_element_type_present.sql") }}</figcaption>
</figure>

{# TODO(authors): Interpet results. #}
Copy link
Member Author

@j9t j9t Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Background: Suggesting to skip this, or ask whether @catalinred or @iandevlin like to draft something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I’d look at @catalinred or @iandevlin whether they like to expand on this.

Other than that, as opposed to the lang section here I don’t agree. Unless you’re coming from an Almanac convention that requires every data point to be interpreted, I wouldn’t be convinced that data can’t stand by itself. I’d suggest that not only are our readers capable of drawing conclusions, but that this can even be refreshing from an editorial perspective, if not to suggest—“wait a minute; what’s this, what does this mean?”

If I misunderstand you, please let me know, otherwise I’d appreciate a bit of leeway here for us to decide on what also not to interpret.

Copy link
Member

@tunetheweb tunetheweb Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is we should be selective to what we put in the chapter. We have a lot of data - some of it will be interesting and some not. I think we need to pick and choose what to include to keep the chapter interesting.

The point of the almanac is "The Web Almanac is a comprehensive report on the state of the web, backed by real data and trusted web experts" rather than "a list of stats, with experts interpreting them".

Saying that, I think these particular stats ARE interesting. Why are they not used? Are they old elements which are not useful? Have they been replaced by better alternatives? Or are they new elements that haven't taken off yet?

Looking at MDN two if them are obsoleted (dir and basefont) and one (rb) is Ruby specific - to me there's a question if that should really be a standard HTML element since it's Ruby specific? Ultimately, I'd never heard of these particular elements and had to look them up to figure this information out, so think a summary of this explaining this would be useful here to save other readers doing the same.

However, I do think this is the authors work, so if you still feel this is not necessary explanation and the stats stand on their own, then I can accept this. But think it's right for @rviscomi to at least ask these questions in case you hadn't considered that interpretation or were making assumptions as to what readers might know because you are HTML experts (and other readers might not be).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to skip SGTM. Could you update the md to remove the content?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually felt like I want to add info here. Added a paragraph. What do you think?

@@ -80,8 +80,7 @@ A page's document size refers to the amount of HTML bytes transferred over the n
* The largest document by far weighs 64.16 _MB_, almost deserving its own analysis and chapter in the Web Almanac.

{# TODO(analysts): Should 25,237 bytes be divided by 1000 or 1024 to convert to KB? 1000 seems to be used here but most chapters use 1024. Are the stats above also off? #}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, 1024 should be used. I'm not confident on the above stats. I'd like to use an example we can prove. I think the problem was the bytesHtmlDoc which I think it more true to what we would call the document got cut off at 16,777,215.

I used Screaming Frog to crawl all the ones in the list, and double checked the big ones in Chrome. Drum roll...

https://www.linkshops.com/ contains a js bundle that is 28.1MB.

https://www.boonterm.com/web/index1.php has 22.8MB of html (embedded images) and references a 34.7MB mp4 file.

https://www.aci.edu.sg/ has 33.7MB of html and took me over a minute to load. Also embedded images.

@rviscomi rviscomi added this to the 2020 Content Writing milestone Dec 3, 2020
@rviscomi rviscomi changed the title chore: address remaining Markup chapter TODOs Markup 2020 TODOs Dec 3, 2020
@@ -121,8 +120,6 @@ Here are the 10 most popular (normalized) languages in our sample. At first we c
<figcaption>{{ figure_link(caption="Top 10 <code>lang</code> attribute values.", sheets_gid="2047285366", sql_file="pages_almanac_by_device_and_html_lang.sql") }}</figcaption>
</figure>

{# TODO(authors): Add an interpretation of the lang results. #}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What conclusions would you like to see readers draw from this figure? If there's nothing to say about it, do we need it at all?

@@ -352,13 +345,11 @@ Standard elements are those that are or were part of the HTML specification. Whi
<figcaption>{{ figure_link(caption="Low probabilities of finding a given element in pages of the sample.", sheets_gid="184700688", sql_file="pages_element_count_by_device_and_element_type_present.sql") }}</figcaption>
</figure>

{# TODO(authors): Interpet results. #}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Member

@rviscomi rviscomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j9t this is looking great and ready for the unedited: true label to be removed. Can you do the honors?

@@ -352,13 +345,11 @@ Standard elements are those that are or were part of the HTML specification. Whi
<figcaption>{{ figure_link(caption="Low probabilities of finding a given element in pages of the sample.", sheets_gid="184700688", sql_file="pages_element_count_by_device_and_element_type_present.sql") }}</figcaption>
</figure>

{# TODO(authors): Interpet results. #}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to skip SGTM. Could you update the md to remove the content?

@@ -121,8 +120,6 @@ Here are the 10 most popular (normalized) languages in our sample. At first we c
<figcaption>{{ figure_link(caption="Top 10 <code>lang</code> attribute values.", sheets_gid="2047285366", sql_file="pages_almanac_by_device_and_html_lang.sql") }}</figcaption>
</figure>

{# TODO(authors): Add an interpretation of the lang results. #}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to remove SGTM.

src/content/en/2020/markup.md Show resolved Hide resolved
j9t added 2 commits December 6, 2020 21:07
Signed-off-by: Jens Oliver Meiert <[email protected]>
@j9t
Copy link
Member Author

j9t commented Dec 6, 2020

Included another update.

Checked the other TODOs but these may be okay right now. @catalinred, @iandevlin, @Tiggerito, do you have a chance to review those maybe later?

Removed the “unedited” flag with this PR.

Any other thoughts, @rviscomi, @bazzadp?

@@ -121,8 +120,6 @@ Here are the 10 most popular (normalized) languages in our sample. At first we c
<figcaption>{{ figure_link(caption="Top 10 <code>lang</code> attribute values.", sheets_gid="2047285366", sql_file="pages_almanac_by_device_and_html_lang.sql") }}</figcaption>
</figure>

{# TODO(authors): Add an interpretation of the lang results. #}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the removal of the table. Either way, LGTM.

src/content/en/2020/markup.md Outdated Show resolved Hide resolved
Copy link
Contributor

@Tiggerito Tiggerito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it should be 61.19MB, but what's a few bytes between friends.

src/content/en/2020/markup.md Show resolved Hide resolved
@j9t
Copy link
Member Author

j9t commented Dec 7, 2020

I still think it should be 61.19MB, but what's a few bytes between friends.

😅 Maybe I missed that. Updated.

Copy link
Member

@rviscomi rviscomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go! 🚀
Thanks @j9t and everyone for your help!

@rviscomi rviscomi merged commit 3f222c5 into HTTPArchive:main Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
editing Content excellence
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants