Design protocol for determining crawl accuracy over time #136

SebastianZimmeck · 2024-09-18T13:37:26Z

@katehausladen provided some initial analysis accuracy analysis as shown in our draft paper (section 3.5). Starting with the September crawl (#118) we should come up with a protocol to check for each crawl going forward 100 randomly selected sites manually whether the crawl results are accurate. As we are crawling over longer periods of time we might otherwise see a drift in accuracy, for example, due to code changes or site changes, and, thus, should keep an eye on it.

I am particularly concerned about the following:

uspapi_before_gpc, uspapi_after_gpc
usp_cookies_before_gpc, usp_cookies_after_gpc
OptanonConsent_before_gpc, OptanonConsent_after_gpc
gpp_before_gpc, gpp_after_gpc
usps_before_gpc, usps_after_gpc
USPS implementation
error
Well-known

A few comments:

@katehausladen came up with strategy to check the ground truth while the crawl was running. That was useful because site loads may differ in terms of which ad networks and other third party a site loads in one run vs another run. Now, if we do a post-hoc analysis after the crawl, we cannot do that as the crawl has already happened. But I think that it is OK because I am less concerned about the urlClassification results but more about the above. Question: do the above change from load to load? I do not think that should be the case since, for example, a site should set the OptanonConsent cookie on every load. But we need to check that.
Also, how do we ensure that our random selection covers all the cases? Having one instance of the GPP implementation behaving correctly is not meaningful. We should aim for around ten instances per condition. So, how do we select random sites but ensuring sufficient instances? Our set of sites is skewed. For example, most sites do not have a GPP string. So, total random selection may not lead to sufficient coverage. Maybe, randomly select from the set of sites that have a GPP implementation per the crawl and confirm 10 positive instances and randomly select from the set of non-GPP sites 10 to confirm negative instances. This analysis is further complicated as there are different sub-conditions, for example, USP API opts out after receiving a GPC signal vs USP API already opted out before receiving a GPC signal. In the first case we have an instance of a string change that our crawl needs to get accurate.

The bottom line is, we need a protocol that allows us to check the analysis accuracy of our different conditions (including sub-conditions) for every crawl to keep track of analysis accuracy over time. Since we need to do it every crawl and it involves manual work, it should be manageable time-wise but also meaningful.

@natelevinson10 will take the lead here and work with @franciscawijaya and @eakubilo before starting the next crawl.

The text was updated successfully, but these errors were encountered:

natelevinson10 · 2024-09-25T23:57:25Z

Did a quick review of the manual data we collected a couple of weeks ago and targeted instances of a mismatch (marked in red) signaling our ground truth was different than the crawl data. I used several VPN locations (California, multiple Colorado, Virginia, and no VPN (CT)) and gave ample time to let all of the site content to load.

I was not able to find a single instance of our manual data changing from what we had reported, except for bumble.com 's USPapi_before being "1YNN" instead of the reported "1YYN", and I would chalk this up to a manual error on our end. It would seem that for the mismatches of data from crawl to manual, the manual data is more accurate.

As for the comments left by @SebastianZimmeck above, cookies do seem to load on every refresh - I have yet to find an instance where an OptAnon cookie does not load where it should. I plan to do some more testing on this to be certain over the next few days. As for our site sample skew, I believe it could be worth it to have a subset of websites we know to have GPP / OTGPPConsent data. A thought I have is to compile a list of websites we know to have all our needed behaviors (i.e. USP API opts out after receiving a GPC signal vs USP API already opted out before receiving a GPC signal ETC.) as these are crucial in our crawl list to get a holistic representation of results. I plan on seeing if there is a list of directory of websites having certain attributes that could simplify a search for these websites, if it is something we choose to do.

SebastianZimmeck · 2024-09-26T01:48:27Z

Thanks, @natelevinson10!

SebastianZimmeck · 2024-10-28T21:47:49Z

As discussed today, @franciscawijaya and @natelevinson10 will come up with a protocol of selecting 100 sites for a manual spotcheck of the first batch that has sufficient coverage (say, at least 5 positive instances, if possible) for each item we test for.

@franciscawijaya and @natelevinson10, you can write the protocol here in the issue for the time being.

natelevinson10 · 2024-11-03T23:25:54Z

To accurately assess the accuracy of the crawl data across the crawl as a whole, our protocol should focus on selecting a representative and stratified sample of 100 sites for manual review. The plan is to run a crawl batch and select 100 sites to review via the constraits below. We will then compare our results from the crawl to our manual review.

Ensure 15 instances of usp api and cookies, and optanon data points changing from before to after. This accounts for 30% of our site list, and will give us ample information on false positives/negatives as these are the most common instances of data.
Ensure 15 instances of gpp before/after. In our CO crawl, we only had 3 instances, that is not enough to draw any conclusion about accuracy.
Ensure 15 instances of OTGPPConsent before/after. Same as above, but only 2 instances.
Next 40 sites will be selected randomly from alternating 100s places (i.e. first 10 from 0-200, next 10 from 201-400, next 10 from 401-600) and so on. This will make sure we pull sites from all points of progression in the crawl.
The final 15 sites will sites flagged with human check errors. This is to address our "edge cases" and confirm that sites flagged with human check errors are not slipping through any accounting for any data mismatches.

After compiling this list of 100 sites, we will manually check them in a similar fashion as we did with the CO crawl here. We will use our same methodology for verifying the maunal results here. Here is our initial plan @franciscawijaya and I reviewed, let us know what you think.

Mattm27 · 2025-02-18T19:49:29Z

The methodology outlined above provides a strong approach by ensuring a randomized stratified sample that is well-representative of the entire crawl. The constraints balance random selection with targeted verification of key values. However, it may be beneficial to update the methodology slightly depending on the output and what the complete crawl data looks like. For example, if OTGPPConsent appears in only a very small portion of the data, we may decide to ignore manual checking for those instances to focus on more prevalent conditions. Additionally, we will include manual checks of the .well-known data as well.

Ensure 15 instances of usp api and cookies, and optanon data points changing from before to after. This accounts for 30% of our site list, and will give us ample information on false positives/negatives as these are the most common instances of data.

Ensure 15 instances of gpp before/after. In our CO crawl, we only had 3 instances, that is not enough to draw any conclusion about accuracy.

Ensure 15 instances of OTGPPConsent before/after. Same as above, but only 2 instances.

Next 40 sites will be selected randomly from alternating 100s places (i.e. first 10 from 0-200, next 10 from 201-400,
next 10 from 401-600) and so on. This will make sure we pull sites from all points of progression in the crawl.

The final 15 sites will sites flagged with human check errors. This is to address our "edge cases" and confirm that sites flagged with human check errors are not slipping through any accounting for any data mismatches.

A sample size of just 10 sites that were discussed in last week's meeting would not be sufficient to draw meaningful conclusions about the accuracy, as I don't think it would capture enough variability across different conditions. However, we could explore whether a slightly reduced sample—around 50 - 75 sites instead of 100—would still provide reliable insights while making the manual review process more efficient.

Since we haven’t run a full crawl since the fall, moving forward with this methodology is the best way to verify the accuracy of the crawl results currently being collected by @samir-cerrato. Running this manual check now will help us confirm whether a drift in accuracy occurs in the future while providing a baseline for future comparisons. The manual check site list should not remain constant across months but should instead change based on the output data of each crawl.

We have laid out a systematic way to execute these manual checks from our CO crawl to ensure consistency in our approach here. To help track discrepancies over time, I can put together a new sheet in the GPC folder that will log discrepancies found each month and allow us to monitor how they fluctuate or change over time.

SebastianZimmeck added the crawl Perform crawl or crawl feature-related label Sep 18, 2024

SebastianZimmeck assigned eakubilo, natelevinson10 and franciscawijaya Sep 18, 2024

SebastianZimmeck mentioned this issue Sep 26, 2024

Evaluate necessity of human check circumvention functionality in the crawler #135

Closed

SebastianZimmeck assigned SebastianZimmeck and unassigned eakubilo Oct 28, 2024

SebastianZimmeck unassigned natelevinson10 Jan 1, 2025

SebastianZimmeck assigned samir-cerrato Jan 17, 2025

SebastianZimmeck assigned Mattm27 Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design protocol for determining crawl accuracy over time #136

Design protocol for determining crawl accuracy over time #136

SebastianZimmeck commented Sep 18, 2024

natelevinson10 commented Sep 25, 2024

SebastianZimmeck commented Sep 26, 2024

SebastianZimmeck commented Oct 28, 2024

natelevinson10 commented Nov 3, 2024 •

edited by SebastianZimmeck

Loading

Mattm27 commented Feb 18, 2025 •

edited

Loading

Design protocol for determining crawl accuracy over time #136

Design protocol for determining crawl accuracy over time #136

Comments

SebastianZimmeck commented Sep 18, 2024

natelevinson10 commented Sep 25, 2024

SebastianZimmeck commented Sep 26, 2024

SebastianZimmeck commented Oct 28, 2024

natelevinson10 commented Nov 3, 2024 • edited by SebastianZimmeck Loading

Mattm27 commented Feb 18, 2025 • edited Loading

natelevinson10 commented Nov 3, 2024 •

edited by SebastianZimmeck

Loading

Mattm27 commented Feb 18, 2025 •

edited

Loading