Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query metrics: Chapter 5. Third Parties #86

Closed
13 tasks done
rviscomi opened this issue Jul 23, 2019 · 7 comments
Closed
13 tasks done

Query metrics: Chapter 5. Third Parties #86

rviscomi opened this issue Jul 23, 2019 · 7 comments
Assignees
Labels
analysis Querying the dataset

Comments

@rviscomi
Copy link
Member

rviscomi commented Jul 23, 2019

Part Chapter Authors Reviewers Tracking Issue
I. Page Content 5. Third Parties @patrickhulce @simonhearne @flowlabs @jasti @zeman #8

READ ME!

All of the metrics in the table below have been marked as Able To Query during the metrics triage. The analyst assigned to each metric is expected to write the corresponding query and submit a PR to have it reviewed and added to the repo.

In order to stay on schedule and have the data ready for authors, please have all metrics reviewed and merged by August 5.

Assignments

ID Metric description Analyst Notes
05.01 Percentage of pages that include at least one third-party resource. @patrickhulce  
05.02 Percentage of pages that include at least one ad resource. @patrickhulce  
05.03 Percentage of requests that are third party requests broken down by third party category by resource type. @patrickhulce  
05.04 Percentage of total bytes that are from third party requests broken down by third party category by resource type. @patrickhulce  
05.05 Percentage of total script execution time that is from third party scripts broken down by third party category. @patrickhulce  
05.06 Top 100 third party domains by request volume @patrickhulce  
05.07 Top 100 third party domains by total byte weight @patrickhulce  
05.08 Top 100 third party domains by total script execution time @patrickhulce  
05.09 Top 100 third party requests by request volume @patrickhulce  
05.10 Top 100 third party requests by total script execution time @patrickhulce  
05.11 Percentile breakdown page-relative percentage of requests that are third party requests broken down by third party category. @patrickhulce  
05.12 Percentile breakdown page-relative percentage of total bytes that are from third party requests broken down by third party category. @patrickhulce  
05.13 Percentile breakdown page-relative percentage of total script execution time that is from third party scripts. @patrickhulce  

Checklist of metrics to be merged

  • 05.01 Percentage of pages that include at least one third-party resource.
  • 05.02 Percentage of pages that include at least one ad resource.
  • 05.03 Percentage of requests that are third party requests broken down by third party category by resource type.
  • 05.04 Percentage of total bytes that are from third party requests broken down by third party category by resource type.
  • 05.05 Percentage of total script execution time that is from third party scripts broken down by third party category.
  • 05.06 Top 100 third party domains by request volume Chapter 5: Add All Byte and Request Count Queries #107
  • 05.07 Top 100 third party domains by total byte weight Chapter 5: Add All Byte and Request Count Queries #107
  • 05.08 Top 100 third party domains by total script execution time
  • 05.09 Top 100 third party requests by request volume
  • 05.10 Top 100 third party requests by total script execution time
  • 05.11 Percentile breakdown page-relative percentage of requests that are third party requests broken down by third party category.
  • 05.12 Percentile breakdown page-relative percentage of total bytes that are from third party requests broken down by third party category.
  • 05.13 Percentile breakdown page-relative percentage of total script execution time that is from third party scripts.
@rviscomi
Copy link
Member Author

rviscomi commented Nov 8, 2019

Hey @patrickhulce, I'm going through your chapter to create the data viz you requested, but some of the query results don't match up with the values you're writing about. For example:

Categories

If the ubiquity of third-party content is unsurprising, perhaps more interesting is the breakdown of third-party content by provider type.

While advertising might be the most user-visible example of third-party presence on the web, analytics providers are the most common third-party category with 76% of sites including at least one analytics request. CDNs at 63%, ads at 57%, and developer utilities like Sentry, Stripe, and Google Maps SDK at 56% follow up as a close second, third, and fourth for appearing on the most web properties. The popularity of these categories forms the foundation of our web usage patterns identified later in the chapter.

<insert graphic of metric 05_11>

Looking at the results of 05_11, I'm not seeing analytics with 76%. The median percentAnalyticsRequestsQuantiles is 2.91% for desktop and 2.82% for mobile. Were you looking at a different metric? Did you modify the query in some way not reflected by the results? FWIW I tweaked your query to only show the 10/25/50/75/90 percentiles as opposed to all 100, but the results are the same.

In your text you're mentioning the percent of sites having analytics (as opposed to requests), which sounds more accurate. But still, I don't know where you got that number for reference.

@rviscomi rviscomi reopened this Nov 8, 2019
@patrickhulce
Copy link
Contributor

patrickhulce commented Nov 8, 2019

FWIW I tweaked your query to only show the 10/25/50/75/90 percentiles as opposed to all 100, but the results are the same.

It will be a little more difficult to see the results I'm talking about with this change. I'm saying that "the most common third-party category with 76% of sites including at least one analytics request", not that analytics requests make up 76% of requests. If you look at the 25th percentile desktop you'll see analytics requests make up 0.93%, meaning they have at least 1, meaning 75% of pages have at least 1.

@rviscomi
Copy link
Member Author

rviscomi commented Nov 8, 2019

Sorry, what's special about 0.93% to indicate that there is at least 1 request? Is 1% == 1 request? Is there a more straightforward way to query this metric? Even if there isn't time to rewrite the query, how can we visualize the current 05_11 results to show what you're referring to?

@patrickhulce
Copy link
Contributor

Well it's not possible to make fractional requests, so anything non-zero indicates that sites at that percentile made at least one request. My goal with the analysis was to point out interesting tidbits that weren't just regurgitating what could be obviously seen from a graph on first glance, but it sounds like this reached a little too far from obvious insights and I should have been a little more aligned with the pure results of the query?

Is there a more straightforward way to query this metric?

This is the query that was optimized to be the most flexible and can show the widest range of insights. I get that it makes cajoling the raw data into a visualization that matches what can be said about that data difficult though.

If we just want to match the analysis then we can basically throw out the quantiles and repeat the line for pages with a third party for each named category (https://github.com/HTTPArchive/almanac.httparchive.org/pull/107/files#diff-561fc8f885c05295633879f79753feffR6)

@rviscomi
Copy link
Member Author

Reverted my query change. I'll close this out and open a new PR with any changes.

@patrickhulce
Copy link
Contributor

Ok sounds good sorry for the trouble @rviscomi thanks very much for tackling those! Let me know if there's something specific I can knock out :) (will be flying starting at ~6pm PST today though, see ya soon!)

@rviscomi
Copy link
Member Author

@patrickhulce just want to make sure you didn't misinterpret your own data. Here's how one of the data points in 05_11 is queried:

APPROX_QUANTILES(numberOfThirdPartyRequests / numberOfRequests, 100)

Each percentile is a percent of requests (count / total), not the number of requests. That's why I was asking about 1% != 1 request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
None yet
Development

No branches or pull requests

2 participants