Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privacy 2020 queries #1129

Merged
merged 12 commits into from
Nov 28, 2020
Merged

Privacy 2020 queries #1129

merged 12 commits into from
Nov 28, 2020

Conversation

max-ostapenko
Copy link
Contributor

@max-ostapenko max-ostapenko commented Jul 30, 2020

Progress on #913

Online tracking

  • top100_trackers_by_websites.sql - using EasyList; by device/category
  • percent_of_tracked_websites_by_country.sql
  • percent_of_websites_with_fingerprint.sql
  • Persistent identifiers (browser storage)

Cookies

  • top100_cookies_set_from_header.sql
  • percent_of_websites_with_iab_tcf_banner.sql
  • percent_of_websites_with_atsdd_schema.sql
  • percent_of_websites_with_cmp.sql
  • percent_of_websites_by_cmp.sql

Privacy policies

  • percent_of_websites_with_privacy_links.sql

@rviscomi rviscomi added the analysis Querying the dataset label Jul 30, 2020
@rviscomi rviscomi added this to the 2020 Analysis milestone Jul 30, 2020
@max-ostapenko max-ostapenko self-assigned this Jul 30, 2020
@max-ostapenko max-ostapenko linked an issue Jul 30, 2020 that may be closed by this pull request
10 tasks
@max-ostapenko max-ostapenko removed a link to an issue Jul 30, 2020
10 tasks
@rviscomi
Copy link
Member

Be sure to update this from "Draft" to "Ready for review" so we can get more eyes on it

@rviscomi rviscomi marked this pull request as ready for review September 20, 2020 00:01
@rviscomi rviscomi requested a review from a team September 20, 2020 00:01
@rviscomi
Copy link
Member

Hi @max-ostapenko can you give us an update on the status of this chapter's analysis? Are there only 2 queries?

@rviscomi
Copy link
Member

rviscomi commented Oct 6, 2020

@max-ostapenko how's this one coming along? Do you need any help?

@max-ostapenko
Copy link
Contributor Author

@rviscomi I lost my billing account last week by accident so have wasted a couple of days to get a new free trial credit.
Now I'm finishing 3rd parties chapter.
Looking forward to your review after pushing the new queries.

@rviscomi rviscomi removed the request for review from a team October 13, 2020 16:20
@rviscomi rviscomi added the ASAP This issue is blocking progress label Oct 13, 2020
@foxdavidj
Copy link
Contributor

@max-ostapenko How are the queries coming along? Is this something you think you can finish this week?

@max-ostapenko
Copy link
Contributor Author

@OBTo @ydimova I'm finishing with 3rd parties chapter today, and expect to have all queries for Privacy chapter drafted and data available in Sheets within next couple days.

@rviscomi rviscomi mentioned this pull request Nov 3, 2020
15 tasks
@rviscomi
Copy link
Member

rviscomi commented Nov 5, 2020

@max-ostapenko how is this analysis coming along? This is the last week to get the analysis in.

@rviscomi
Copy link
Member

Hi @max-ostapenko. Checking in again on this PR. Do you think you can get this finished by the end of the day today?

@ydimova you've also expressed an interest in contributing to the analysis, are you able to help complete the remaining queries?

@max-ostapenko
Copy link
Contributor Author

@rviscomi visualised half of the queries. And finishing CMPs and trackers stats.
Maybe you could have a look at the committed ones already?

@rviscomi
Copy link
Member

rviscomi commented Nov 11, 2020

Hi @max-ostapenko, I've made some changes to the chart in the first sheet so that desktop and mobile are shown as separate series/colors:

image

Note that I hid the cell with the column name of the percentage to avoid it showing up in the chart with its raw SQL name.

Could you copy/paste that chart to the other sheets to retain the same formatting, and update them with the new sheets' data?

When do you expect to have the remaining queries implemented and ready for review? I'm very concerned about this chapter meeting the December 9 deadline given that it's been blocked on analysis for a while.

@max-ostapenko
Copy link
Contributor Author

@rviscomi still need some time to identify how to make a query on top of EasyList.

@rviscomi
Copy link
Member

Is this the list? https://easylist.to/easylist/easylist.txt

The way I'd do it would be to parse the list in JS and generate something that can be processed by SQL, like an array of objects. The prefixes also look significant but I'm not sure what they mean. They can be fields of the object, for example: [{prefix: '@@', params: '&foo&bar'}]. Then in the SQL you can return that from a UDF to get it strictly typed. Unnest the list in SQL, join it with the request URLs, and you're on your way. Hope that helps.

@ydimova
Copy link
Contributor

ydimova commented Nov 13, 2020

@max-ostapenko here you can find an overview of the different rules and what they mean https://adblockplus.org/forum/viewtopic.php?t=7702&start=0. We could make a list of regex expressions for instance if we can parse the rules correctly.
Another option is to use WhoTracksMe. They have the eTLD+1 of the trackers.
Let me know if you want me to give it a try for easlylist or WhoTracksMe.

@rviscomi
Copy link
Member

rviscomi commented Nov 14, 2020

There are also ~3k known ad domains in the httparchive.almanac.third_parties table:

SELECT
  domain AS host
FROM
  `httparchive.almanac.third_parties`
WHERE
  date = '2020-08-01' AND
  category = 'ad'

You could join those with NET.HOST(request_url) to determine if it's an ad. cc @patrickhulce

@tunetheweb
Copy link
Member

Can you merge main into this branch to pick up #1535 to avoid check errors that will happen next time you change this branch.

@rviscomi
Copy link
Member

@max-ostapenko @ydimova can you give a status update on this analysis? Friendly reminder that we're launching in ~2 weeks ☺️

@rviscomi
Copy link
Member

I'm going to merge these queries as-is and any changes can be applied in a follow-up PR.

@rviscomi rviscomi merged commit 145fec4 into main Nov 28, 2020
@rviscomi rviscomi deleted the privacy-sql-2020 branch November 28, 2020 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset ASAP This issue is blocking progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants