Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query metrics: Chapter 20. HTTP/2 #101

Closed
14 tasks done
rviscomi opened this issue Jul 23, 2019 · 17 comments · Fixed by #176
Closed
14 tasks done

Query metrics: Chapter 20. HTTP/2 #101

rviscomi opened this issue Jul 23, 2019 · 17 comments · Fixed by #176
Assignees
Labels
analysis Querying the dataset ASAP This issue is blocking progress

Comments

@rviscomi
Copy link
Member

rviscomi commented Jul 23, 2019

Part Chapter Authors Reviewers Tracking Issue
IV. Content Distribution 20. HTTP/2 @bazzadp @bagder @rmarx @dotjs #22

READ ME!

All of the metrics in the table below have been marked as Able To Query during the metrics triage. The analyst assigned to each metric is expected to write the corresponding query and submit a PR to have it reviewed and added to the repo.

In order to stay on schedule and have the data ready for authors, please have all metrics reviewed and merged by August 5.

Assignments

ID Metric description Analyst Notes
20.01 Adoption rate of HTTP/2 by site (home page only) and by requests (all request on page) over the years. Trend graph over all available years. @paulcalvano  
20.03 Average percentage of resources loaded over HTTP/2 (or gQUIC) versus HTTP/1.1 per site. Trend graph over all available years. @paulcalvano  
20.04 Number of HTTP (not HTTPS) sites which return upgrade HTTP header containing h2. Once off stat for last crawl. @paulcalvano  
20.05 Number of HTTPS sites using HTTP/2 which return upgrade HTTP header containing h2. Once off stat for last crawl. @paulcalvano  
20.06 Number of HTTPS sites not using HTTP/2 which return upgrade HTTP header containing h2. Once off stat for last crawl. @paulcalvano  
20.07 % of sites affected by CDN prioritization issues (H2 and served by CDN) - https://github.com/andydavies/http2-prioritization-issues#cdns--cloud-hosting-services. Once off stat for last crawl. @paulcalvano Possible per Barry's comment
20.08 Count of HTTP/2 sites grouped by server HTTP header value but strip version numbers (e.g. Apache and Apache 2.4.28 and Apache 2.4.29 should all report as Apache, but Apache Tomcat should report as Tomcat. Probably need to massive the results to achieve this). Once off stat for last crawl. @paulcalvano  
20.09 Count of non-HTTP/2 sites grouped by server HTTP header value but strip version numbers. Once off stat for last crawl. @paulcalvano  
20.10 Count of HTTP/2 sites which use HTTP/2 Push. Trend graph over all available years. @paulcalvano HAR files - "_was_pushed": 1,
20.11 Average number of HTTP/2 Pushed resources and average bytes. Once off stat for last crawl. @paulcalvano HAR files - "_was_pushed": 1,
20.12 Count and number of bytes pushed by asset type (CSS, JS, Images...etc.). Once off stat for last crawl. @paulcalvano HAR files - "_was_pushed": 1,
20.13 Count of preload HTTP Headers with nopush attribute set. Once off stat for last crawl. @paulcalvano  
20.15 Measure number of TCP Connections per site. Average number of domains per site still going down year on year as per HTTP Archive State of the Web report? Trend graph over all available years. @paulcalvano  
20.16 Measure average number of TCP Connections per site for HTTP/1.1 sites versus HTTP/2 sites. Once off stat for last crawl. @paulcalvano  

Checklist of metrics to be merged

  • 20.01 Adoption rate of HTTP/2 by site (home page only) and by requests (all request on page) over the years. Trend graph over all available years.
  • 20.03 Average percentage of resources loaded over HTTP/2 (or gQUIC) versus HTTP/1.1 per site. Trend graph over all available years.
  • 20.04 Number of HTTP (not HTTPS) sites which return upgrade HTTP header containing h2. Once off stat for last crawl.
  • 20.05 Number of HTTPS sites using HTTP/2 which return upgrade HTTP header containing h2. Once off stat for last crawl.
  • 20.06 Number of HTTPS sites not using HTTP/2 which return upgrade HTTP header containing h2. Once off stat for last crawl.
  • 20.07 % of sites affected by CDN prioritization issues (H2 and served by CDN) - https://github.com/andydavies/http2-prioritization-issues#cdns--cloud-hosting-services. Once off stat for last crawl.
  • 20.08 Count of HTTP/2 sites grouped by server HTTP header value but strip version numbers (e.g. Apache and Apache 2.4.28 and Apache 2.4.29 should all report as Apache, but Apache Tomcat should report as Tomcat. Probably need to massive the results to achieve this). Once off stat for last crawl.
  • 20.09 Count of non-HTTP/2 sites grouped by server HTTP header value but strip version numbers. Once off stat for last crawl.
  • 20.10 Count of HTTP/2 sites which use HTTP/2 Push. Trend graph over all available years.
  • 20.11 Average number of HTTP/2 Pushed resources and average bytes. Once off stat for last crawl.
  • 20.12 Count and number of bytes pushed by asset type (CSS, JS, Images...etc.). Once off stat for last crawl.
  • 20.13 Count of preload HTTP Headers with nopush attribute set. Once off stat for last crawl.
  • 20.15 Measure number of TCP Connections per site. Average number of domains per site still going down year on year as per HTTP Archive State of the Web report? Trend graph over all available years.
  • 20.16 Measure average number of TCP Connections per site for HTTP/1.1 sites versus HTTP/2 sites. Once off stat for last crawl.
@rviscomi rviscomi added the analysis Querying the dataset label Jul 23, 2019
@rviscomi rviscomi added this to the Content written milestone Jul 23, 2019
@rviscomi
Copy link
Member Author

@paulcalvano any progress on this?

@rviscomi rviscomi added the ASAP This issue is blocking progress label Sep 4, 2019
@paulcalvano
Copy link
Contributor

Working on some of these queries tonight. Quick question on the yearly trend - each yearly trend on these queries will process more than 100 TB of data. Is this ok @rviscomi ?

@paulcalvano
Copy link
Contributor

8 of these queries were added to this PR - #127

Results from the H2 queries are here - https://docs.google.com/spreadsheets/d/1z1gdS3YVpe8J9K3g2UdrtdSPhRywVQRBz5kgBeqCnbw/edit?usp=sharing

@rviscomi
Copy link
Member Author

rviscomi commented Sep 5, 2019

100 TB per query is too expensive. Maybe just compare July 2019 vs July 2018?

@paulcalvano
Copy link
Contributor

Good idea. I think doing a 1 year comparison is still very useful, and it will certainly keep the query cost down.

@tunetheweb
Copy link
Member

Some of the stats are already available as a yearly trend. For example:

Happy to use those. Are they cheaper as only use summary tables or are they generated in a different way? Either way happy to use that data where we have it though will the State Of The Web still hang around for future years or is the idea the Web Almanac replaces it? For other data that isn’t as easily and cheaply queried then a year on year comparison sounds fine.

@paulcalvano
Copy link
Contributor

20.15 is easily trendable since it uses the summary tables.

Most of the H2 analysis requires the requests table, which is currently the only place where we can query the protocol. Since it's part of the monthly pipeline, the trend data has accumulated over time instead of being run at once. I think working with the curated report results should be fine for your needs on this 20.02 (H2 requests over time). But 20.01 is not covered by that report since we need to look at H2 base page requests.

@tunetheweb
Copy link
Member

OK that's what I thought.

And to be clear you meant 20.01 part 2 (total HTTP/2 requests) is fine, but 20.01 part 1 (total sites - aka home pages only) is not fine? You said 20.02 in your comment but 20.02 was dropped. I guess I should really have separated 20.01 out into two asks to avoid this confusion :-)

@rviscomi
Copy link
Member Author

rviscomi commented Sep 5, 2019

Are they cheaper as only use summary tables or are they generated in a different way? Either way happy to use that data where we have it though will the State Of The Web still hang around for future years or is the idea the Web Almanac replaces it?

Both will coexist indefinitely. The distinction being that httparchive.org is the live historical view of the state of the web while almanac.httparchive.org is a companion report that elaborates on what that year's results actually mean.

@rviscomi
Copy link
Member Author

rviscomi commented Sep 8, 2019

@paulcalvano I've checked off the metrics that have been covered by #127 although there are still 7 remaining. Once those are finalized we can pass this over to @bazzadp to start writing.

@paulcalvano
Copy link
Contributor

@bazzadp - I should have these done soon. Few questions on the remaining queries:

20.07 - I'm not clear on how to detect H2 prioritization issues within the HTTP Archive data. One way we could estimate it is by categorizing sites as using web servers that pass/fail, as well as CDNs that pass/fail. If we did that we could consider a fail to be:

  • use of a CDN that fails prioritization test
    OR
  • use of a webserver that fails priotization test + No CDN

20.13 - Are you just interested in HTTP headers with preload, or are HTTP response bodies containing preload of interest here?

20.15, do you want the average number of TCP connections? Or a breakdown of how many sites have 1 connection, 2 connections, n connections?

@tunetheweb
Copy link
Member

Hey @paulcalvano,

20.07 - I think we should limit this to CDNs as those are the only ones we have a definitive list for and they should be reasonably consistent compared to server setup which varies a lot in installed version, config and O/S TCP stacks...etc. So if we could report like this based on Andy's list (with additional "Not using CDN" and "Other CDN" lines at the top):

CDN Prioritises correctly Percentage of Home Pages
Not using CDN Unknown 56%
Other CDN Unknown 12%
Akamai Yes 10%
Amazon CloudFront No 5%
...etc. for rest of Andy's list ... ...

Not sure how easy to import Andy's table or if we have do this lookup in Google Sheets afterwards?

20.13 Just HTTP Headers. Preload in HTML is not a signal to HTTP/2 push. Plus for this stat I'm actually explicitly looking at usage of "nopush" in the HTTP Header to prevent HTTP/2 push.

20.15 Interesting question. I guess what I'm trying to show is, is adoption of HTTP/2 resulting in 1) less connections (as it uses 1 connection rather than 6) and 2) less usage of sharding (e.g. static.example.com type domains). It is probably most simply measured with the current stat on TCP connections per page though split by HTTP/1 and HTTP/2 home pages (where a HTTP/2 site is based on whether the main index.html page (or equivalent) is served over HTTP/2 or not). I think your suggestion of breakdown by number of connections would be too influenced by marketing stuff (e.g. example.com stopped using 6 connections but still loading 100 ad tech things means we go down from 106 connections to 100 connections so not really that noticeable).

BTW did you see my comment #22 (comment) on some of the stats you dropped off? Is it possible to look at adding 20.02 and 20.17 back in based on those comments?

@paulcalvano
Copy link
Contributor

Thanks. I added httparchive.almanac.h2_prioritization_cdns_201909 with the latest results from Andy's table and was able to create a query matching the table in your example.

@paulcalvano
Copy link
Contributor

Just submitted a PR with more H2 queries. Some notes

  • I wrote another query named 20_04a_05a_06a, which contains details for 20.04, 20.05 and 20.06
  • 20.15 and 20.16 are very similar, and I think the query that I wrote for 20.15 satisfies both. However if I'm mistaken, let me know and I'll update.

I'll review comment #22 in the morning and see if we can add 20.02 and 20.17 back in.

@paulcalvano
Copy link
Contributor

@paulcalvano
Copy link
Contributor

paulcalvano commented Sep 26, 2019

Added 20.02. Will take a look at 20.17 now...

20.2 - Measure of all HTTP versions (0.9, 1.0, 1.1, 2, QUIC) for main page of all sites, and for HTTPS sites. Table for last crawl.

@tunetheweb
Copy link
Member

Thanks @paulcalvano! Marked the results with a few questions. Big one being that I'm not seeing QUIC anywhere - or is the blank protocol QUIC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset ASAP This issue is blocking progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants