Write queries and add to the repo #62

rviscomi · 2019-06-25T11:27:25Z

When the Analyst team generates queries for each metric, they should create a PR to merge it into the repo. This has two benefits: the PR process provides an opportunity for peer review, and it is a place to share and maintain the canonical queries. On the Almanac website we can link directly to the queries from each respective chapter/figure so readers can see exactly how it was calculated and fork it for their own analysis.

create a new directory system to organize queries (@KJLarson)
test queries (analysts)
file a PR to merge the queries into their respective directory (analysts)

For testing queries, you can query the new almanac dataset, which contains desktop/mobile sample tables for 1,000 websites. This smaller dataset should help you refine your queries without incurring the full cost for all ~5M websites.

Query guidelines:

must specify #standardSQL on the first line and use Standard SQL
must include a short description of the metric it's analyzing, eg:

# Percentage of requests that are third party requests
# broken down by third party category by resource type.

must query the 2019_07_01 dataset (unless otherwise needed)
must be reasonably optimized where possible
file must be named according to its metric ID, eg 05_03.sql
file must be placed in the directory according to its chapter, eg 05_ThirdParties/05_03.sql

The text was updated successfully, but these errors were encountered:

rviscomi · 2019-06-25T11:33:04Z

@HTTPArchive/developers any takers for the first task? We need to figure out a good home for the queries and create the directory structure.

I think it should be named according to this pattern /sql/2019/05/05.3.sql, but I'm open to suggestions. Some open questions:

should the chapter directories be named numerically or by their title, eg 05 vs Third Parties?
should the chapter.metric.sql format use a . delimiter between chapter and metric, or does that interfere/confuse at all with the file extension?
should the file names also include the chapter number, eg 05.3.sql, or omit the (redundant) chapter, eg 3.sql?

rviscomi · 2019-06-25T12:41:01Z

@KJLarson FYI this is another good first issue (task 1 of 3, creating directory system)

KJLarson · 2019-06-25T14:32:39Z

Oooh, oooh! I can do this! I've been doing records management for the last couple years and have a Masters of Library and Information Science...this is totally in my realm!

rviscomi · 2019-06-25T15:22:48Z

Sold! Thanks @KJLarson!

Have a look at the open questions in #62 (comment) and feel free to start a PR with the new sql directories.

KJLarson · 2019-06-25T19:29:45Z

Here are some initial thoughts and questions I have after looking at the questions from comment #62, the metrics triage spreadsheet, the file structure of HTTPArchive.org, and some records management naming convention articles:

I don't think the metric IDs will sort in order the way they are written now. Adding leading zeros to the metric number should fix sorting issues.
Underscores would probably be the best way to delimit the components of the directory and file names
I think it would be helpful to have the chapter number and name in the directory name. This way, if someone is looking at the list of chapter directories, it doesn't matter if they know just the chapter number or just the chapter name...they can still find what they are looking for
I think having the chapter number with the the metric number in the file name would be helpful and add context in case a file gets separated from its chapter's metric family somehow. And the chapter numbers won't add much length to the file names/paths.
Will the same queries be used every year?

KJLarson · 2019-06-25T19:31:50Z

Here's my first directory structure thought (I didn't fill it all in; hopefully this is enough to get a picture of what it would look like):

src
+--- sql
     +--- 2019
     |    +--- 01_JavaScript
     |    |       01_01.sql
     |    |       01_02.sql
     |    +--- 02_CSS
     |    |       02_01.sql
     |    |       02_02.sql
     |    |       02_03.sql
     |    +--- 03_Markup
     |    |       03_01.sql
     |    |       03_02.sql
     |    +--- 04_Media
     |    |       04_01.sql
     |    +--- 05_ThirdParties
     |    |       05_01.sql
     |    +--- 06_Fonts
     |    +--- 07_Performance
     |    +--- 08_Security
     |    +--- 09_Accessibility
     |    +--- 10_SEO
     |    +--- 11_PWA
     |    +--- 12_MobileWeb
     |    +--- 13_Ecommerce
     |    +--- 14_CMS
     |    +--- 15_Compression
     |    +--- 16_Caching
     |    +--- 17_CDN
     |    +--- 18_PageWeight
     |    +--- 19_ResourceHints
     |    +--- 20_HTTP2
     +--- 2020
     |    +--- 01_

The metric IDs look a bit different than they do in the spreadsheet. Not sure how much room, if any, there is to stray from how we write the numbers.

rviscomi · 2019-06-25T19:38:47Z

That looks perfect! Thanks for the thoughtful approach.

I don't think the metric IDs will sort in order the way they are written now. Adding leading zeros to the metric number should fix sorting issues.

Very good catch.

I think it would be helpful to have the chapter number and name in the directory name. This way, if someone is looking at the list of chapter directories, it doesn't matter if they know just the chapter number or just the chapter name...they can still find what they are looking for

Good suggestion.

Will the same queries be used every year?

This won't be the case because the 2019 queries will explicitly reference the 2019_07_01 dataset.

The metric IDs look a bit different than they do in the spreadsheet. Not sure how much room, if any, there is to stray from how we write the numbers.

Not a problem at all.

I only have two small followup questions: since this isn't used directly by the web server, can we move it out of src? for chapters with multiple words could we delimit words by _?

Bonus question: for queries that could possibly be used in multiple chapters, how do you think they should be named?

KJLarson · 2019-06-25T19:52:49Z

Yes and yes.

Bonus question: Hmmm...I will have to think about this one. Does multiple mean a couple chapters, most chapters, or somewhere in between? I suppose it wouldn't be ideal to save these queries in different directories with different names. Is there value added knowing that the same query was used in multiple chapters?

…ach chapter. Progress on HTTPArchive#62

rviscomi · 2019-06-27T15:49:41Z

I think at most there will be overlap for 10 metrics, but probably closer to 2 or 3. Thinking more about this, we should probably still name them according to the chapter they appear in and have duplicates because casual readers who want to explore the queries won't care if it's used somewhere else, they just want to find the corresponding query.

* Create contributors.json * Add data structure and samples. Progress on #51 * Add code to get and display contributors data. Progress on #51 * Remove import jsonify Co-Authored-By: Rick Viscomi <[email protected]> * Update render_template in src/main.py. Progress on #57. Co-Authored-By: Rick Viscomi <[email protected]> * Restructure team data. Progress on #57 * Add method to update contributors. Progress on #57 Co-Authored-By: Rick Viscomi <[email protected]> * Declare global variable first Co-Authored-By: Rick Viscomi <[email protected]> * Remove comments * Add loop to render contributor's team. Progress on #57 Co-Authored-By: Rick Viscomi <[email protected]> * make user IDs lowercase * remove contributors GitHub and Twitter * Add directories for sql queries and sql queries for first metric in each chapter. Progress on #62 * Delete contributor changes. Progress on #65 * Remove contributors.json. Progress on #65

rviscomi · 2019-09-26T05:36:14Z

PRs with queries for all metrics have been merged or are being reviewed! 🎉

rviscomi added development Building the Almanac tech stack analysis Querying the dataset labels Jun 25, 2019

rviscomi added this to the Infrastructure prepared milestone Jun 25, 2019

rviscomi added the good first issue Good for newcomers label Jun 25, 2019

rviscomi assigned KJLarson Jun 25, 2019

KJLarson added a commit to KJLarson/almanac.httparchive.org that referenced this issue Jun 26, 2019

Add directories for sql queries and sql queries for first metric in e…

a1ec011

…ach chapter. Progress on HTTPArchive#62

KJLarson mentioned this issue Jun 27, 2019

Create directory structure for metric queries #65

Merged

rviscomi modified the milestones: Infrastructure prepared, Content written Jun 30, 2019

rviscomi unassigned KJLarson Jun 30, 2019

rviscomi removed the development Building the Almanac tech stack label Jun 30, 2019

rviscomi mentioned this issue Jul 1, 2019

Assign analysts to chapters #71

Closed

rviscomi removed the good first issue Good for newcomers label Jul 3, 2019

patrickhulce mentioned this issue Jul 29, 2019

Chapter 5: Add All Byte and Request Count Queries #107

Merged

rviscomi added the ASAP This issue is blocking progress label Sep 4, 2019

rviscomi closed this as completed Sep 26, 2019

rviscomi mentioned this issue Mar 16, 2020

Better image directory folders #683

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write queries and add to the repo #62

Write queries and add to the repo #62

rviscomi commented Jun 25, 2019 •

edited

Loading

rviscomi commented Jun 25, 2019

rviscomi commented Jun 25, 2019

KJLarson commented Jun 25, 2019

rviscomi commented Jun 25, 2019

KJLarson commented Jun 25, 2019

KJLarson commented Jun 25, 2019 •

edited

Loading

rviscomi commented Jun 25, 2019

KJLarson commented Jun 25, 2019

rviscomi commented Jun 27, 2019

rviscomi commented Sep 26, 2019

Write queries and add to the repo #62

Write queries and add to the repo #62

Comments

rviscomi commented Jun 25, 2019 • edited Loading

rviscomi commented Jun 25, 2019

rviscomi commented Jun 25, 2019

KJLarson commented Jun 25, 2019

rviscomi commented Jun 25, 2019

KJLarson commented Jun 25, 2019

KJLarson commented Jun 25, 2019 • edited Loading

rviscomi commented Jun 25, 2019

KJLarson commented Jun 25, 2019

rviscomi commented Jun 27, 2019

rviscomi commented Sep 26, 2019

rviscomi commented Jun 25, 2019 •

edited

Loading

KJLarson commented Jun 25, 2019 •

edited

Loading