Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write queries and add to the repo #62

Closed
3 tasks done
rviscomi opened this issue Jun 25, 2019 · 10 comments
Closed
3 tasks done

Write queries and add to the repo #62

rviscomi opened this issue Jun 25, 2019 · 10 comments
Labels
analysis Querying the dataset ASAP This issue is blocking progress

Comments

@rviscomi
Copy link
Member

rviscomi commented Jun 25, 2019

When the Analyst team generates queries for each metric, they should create a PR to merge it into the repo. This has two benefits: the PR process provides an opportunity for peer review, and it is a place to share and maintain the canonical queries. On the Almanac website we can link directly to the queries from each respective chapter/figure so readers can see exactly how it was calculated and fork it for their own analysis.

  • create a new directory system to organize queries (@KJLarson)
  • test queries (analysts)
  • file a PR to merge the queries into their respective directory (analysts)

For testing queries, you can query the new almanac dataset, which contains desktop/mobile sample tables for 1,000 websites. This smaller dataset should help you refine your queries without incurring the full cost for all ~5M websites.

Query guidelines:

  • must specify #standardSQL on the first line and use Standard SQL
  • must include a short description of the metric it's analyzing, eg:
# Percentage of requests that are third party requests
# broken down by third party category by resource type.
  • must query the 2019_07_01 dataset (unless otherwise needed)
  • must be reasonably optimized where possible
  • file must be named according to its metric ID, eg 05_03.sql
  • file must be placed in the directory according to its chapter, eg 05_ThirdParties/05_03.sql
@rviscomi rviscomi added development Building the Almanac tech stack analysis Querying the dataset labels Jun 25, 2019
@rviscomi rviscomi added this to the Infrastructure prepared milestone Jun 25, 2019
@rviscomi
Copy link
Member Author

@HTTPArchive/developers any takers for the first task? We need to figure out a good home for the queries and create the directory structure.

I think it should be named according to this pattern /sql/2019/05/05.3.sql, but I'm open to suggestions. Some open questions:

  • should the chapter directories be named numerically or by their title, eg 05 vs Third Parties?
  • should the chapter.metric.sql format use a . delimiter between chapter and metric, or does that interfere/confuse at all with the file extension?
  • should the file names also include the chapter number, eg 05.3.sql, or omit the (redundant) chapter, eg 3.sql?

@rviscomi rviscomi added the good first issue Good for newcomers label Jun 25, 2019
@rviscomi
Copy link
Member Author

@KJLarson FYI this is another good first issue (task 1 of 3, creating directory system)

@KJLarson
Copy link
Contributor

Oooh, oooh! I can do this! I've been doing records management for the last couple years and have a Masters of Library and Information Science...this is totally in my realm!

@rviscomi
Copy link
Member Author

Sold! Thanks @KJLarson!

Have a look at the open questions in #62 (comment) and feel free to start a PR with the new sql directories.

@KJLarson
Copy link
Contributor

Here are some initial thoughts and questions I have after looking at the questions from comment #62, the metrics triage spreadsheet, the file structure of HTTPArchive.org, and some records management naming convention articles:

  • I don't think the metric IDs will sort in order the way they are written now. Adding leading zeros to the metric number should fix sorting issues.
  • Underscores would probably be the best way to delimit the components of the directory and file names
  • I think it would be helpful to have the chapter number and name in the directory name. This way, if someone is looking at the list of chapter directories, it doesn't matter if they know just the chapter number or just the chapter name...they can still find what they are looking for
  • I think having the chapter number with the the metric number in the file name would be helpful and add context in case a file gets separated from its chapter's metric family somehow. And the chapter numbers won't add much length to the file names/paths.
  • Will the same queries be used every year?

@KJLarson
Copy link
Contributor

KJLarson commented Jun 25, 2019

Here's my first directory structure thought (I didn't fill it all in; hopefully this is enough to get a picture of what it would look like):

src
+--- sql
     +--- 2019
     |    +--- 01_JavaScript
     |    |       01_01.sql
     |    |       01_02.sql
     |    +--- 02_CSS
     |    |       02_01.sql
     |    |       02_02.sql
     |    |       02_03.sql
     |    +--- 03_Markup
     |    |       03_01.sql
     |    |       03_02.sql
     |    +--- 04_Media
     |    |       04_01.sql
     |    +--- 05_ThirdParties
     |    |       05_01.sql
     |    +--- 06_Fonts
     |    +--- 07_Performance
     |    +--- 08_Security
     |    +--- 09_Accessibility
     |    +--- 10_SEO
     |    +--- 11_PWA
     |    +--- 12_MobileWeb
     |    +--- 13_Ecommerce
     |    +--- 14_CMS
     |    +--- 15_Compression
     |    +--- 16_Caching
     |    +--- 17_CDN
     |    +--- 18_PageWeight
     |    +--- 19_ResourceHints
     |    +--- 20_HTTP2
     +--- 2020
     |    +--- 01_

The metric IDs look a bit different than they do in the spreadsheet. Not sure how much room, if any, there is to stray from how we write the numbers.

@rviscomi
Copy link
Member Author

That looks perfect! Thanks for the thoughtful approach.

I don't think the metric IDs will sort in order the way they are written now. Adding leading zeros to the metric number should fix sorting issues.

Very good catch.

I think it would be helpful to have the chapter number and name in the directory name. This way, if someone is looking at the list of chapter directories, it doesn't matter if they know just the chapter number or just the chapter name...they can still find what they are looking for

Good suggestion.

Will the same queries be used every year?

This won't be the case because the 2019 queries will explicitly reference the 2019_07_01 dataset.

The metric IDs look a bit different than they do in the spreadsheet. Not sure how much room, if any, there is to stray from how we write the numbers.

Not a problem at all.

I only have two small followup questions: since this isn't used directly by the web server, can we move it out of src? for chapters with multiple words could we delimit words by _?

Bonus question: for queries that could possibly be used in multiple chapters, how do you think they should be named?

@KJLarson
Copy link
Contributor

Yes and yes.

Bonus question: Hmmm...I will have to think about this one. Does multiple mean a couple chapters, most chapters, or somewhere in between? I suppose it wouldn't be ideal to save these queries in different directories with different names. Is there value added knowing that the same query was used in multiple chapters?

KJLarson added a commit to KJLarson/almanac.httparchive.org that referenced this issue Jun 26, 2019
@rviscomi
Copy link
Member Author

I think at most there will be overlap for 10 metrics, but probably closer to 2 or 3. Thinking more about this, we should probably still name them according to the chapter they appear in and have duplicates because casual readers who want to explore the queries won't care if it's used somewhere else, they just want to find the corresponding query.

rviscomi pushed a commit that referenced this issue Jun 29, 2019
* Create contributors.json

* Add data structure and samples. Progress on #51

* Add code to get and display contributors data. Progress on #51

* Remove import jsonify

Co-Authored-By: Rick Viscomi <[email protected]>

* Update render_template in src/main.py. Progress on #57.

Co-Authored-By: Rick Viscomi <[email protected]>

* Restructure team data. Progress on #57

* Add method to update contributors. Progress on #57

Co-Authored-By: Rick Viscomi <[email protected]>

* Declare global variable first

Co-Authored-By: Rick Viscomi <[email protected]>

* Remove comments

* Add loop to render contributor's team. Progress on #57

Co-Authored-By: Rick Viscomi <[email protected]>

* make user IDs lowercase

* remove contributors GitHub and Twitter

* Add directories for sql queries and sql queries for first metric in each chapter. Progress on #62

* Delete contributor changes. Progress on #65

* Remove contributors.json. Progress on #65
@rviscomi rviscomi removed the development Building the Almanac tech stack label Jun 30, 2019
@rviscomi rviscomi removed the good first issue Good for newcomers label Jul 3, 2019
@rviscomi
Copy link
Member Author

PRs with queries for all metrics have been merged or are being reviewed! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset ASAP This issue is blocking progress
Projects
None yet
Development

No branches or pull requests

2 participants