-
-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write queries and add to the repo #62
Comments
@HTTPArchive/developers any takers for the first task? We need to figure out a good home for the queries and create the directory structure. I think it should be named according to this pattern
|
@KJLarson FYI this is another good first issue (task 1 of 3, creating directory system) |
Oooh, oooh! I can do this! I've been doing records management for the last couple years and have a Masters of Library and Information Science...this is totally in my realm! |
Sold! Thanks @KJLarson! Have a look at the open questions in #62 (comment) and feel free to start a PR with the new sql directories. |
Here are some initial thoughts and questions I have after looking at the questions from comment #62, the metrics triage spreadsheet, the file structure of HTTPArchive.org, and some records management naming convention articles:
|
Here's my first directory structure thought (I didn't fill it all in; hopefully this is enough to get a picture of what it would look like):
The metric IDs look a bit different than they do in the spreadsheet. Not sure how much room, if any, there is to stray from how we write the numbers. |
That looks perfect! Thanks for the thoughtful approach.
Very good catch.
Good suggestion.
This won't be the case because the 2019 queries will explicitly reference the 2019_07_01 dataset.
Not a problem at all. I only have two small followup questions: since this isn't used directly by the web server, can we move it out of Bonus question: for queries that could possibly be used in multiple chapters, how do you think they should be named? |
Yes and yes. Bonus question: Hmmm...I will have to think about this one. Does multiple mean a couple chapters, most chapters, or somewhere in between? I suppose it wouldn't be ideal to save these queries in different directories with different names. Is there value added knowing that the same query was used in multiple chapters? |
…ach chapter. Progress on HTTPArchive#62
I think at most there will be overlap for 10 metrics, but probably closer to 2 or 3. Thinking more about this, we should probably still name them according to the chapter they appear in and have duplicates because casual readers who want to explore the queries won't care if it's used somewhere else, they just want to find the corresponding query. |
* Create contributors.json * Add data structure and samples. Progress on #51 * Add code to get and display contributors data. Progress on #51 * Remove import jsonify Co-Authored-By: Rick Viscomi <[email protected]> * Update render_template in src/main.py. Progress on #57. Co-Authored-By: Rick Viscomi <[email protected]> * Restructure team data. Progress on #57 * Add method to update contributors. Progress on #57 Co-Authored-By: Rick Viscomi <[email protected]> * Declare global variable first Co-Authored-By: Rick Viscomi <[email protected]> * Remove comments * Add loop to render contributor's team. Progress on #57 Co-Authored-By: Rick Viscomi <[email protected]> * make user IDs lowercase * remove contributors GitHub and Twitter * Add directories for sql queries and sql queries for first metric in each chapter. Progress on #62 * Delete contributor changes. Progress on #65 * Remove contributors.json. Progress on #65
PRs with queries for all metrics have been merged or are being reviewed! 🎉 |
When the Analyst team generates queries for each metric, they should create a PR to merge it into the repo. This has two benefits: the PR process provides an opportunity for peer review, and it is a place to share and maintain the canonical queries. On the Almanac website we can link directly to the queries from each respective chapter/figure so readers can see exactly how it was calculated and fork it for their own analysis.
For testing queries, you can query the new almanac dataset, which contains desktop/mobile sample tables for 1,000 websites. This smaller dataset should help you refine your queries without incurring the full cost for all ~5M websites.
Query guidelines:
#standardSQL
on the first line and use Standard SQL05_03.sql
05_ThirdParties/05_03.sql
The text was updated successfully, but these errors were encountered: