-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions on Information Flow #5
Comments
Hi Michael, thank you for your reply and interest in SPARROW! Before going through the various questions - we think that you made a valid point and that the Gatekeeper role being insured by those handling the auction could add additional privacy protections. In this context, what do you think of the following propositions:
Taking into account the above, please find below the answers to your questions:
That’s an assumption we also make. We think that it could be enforced through contractual agreements and audit procedures, but we would welcome any technical idea (cryptography?) that would further ensure the Gatekeeper trustworthiness. The Gatekeeper role definitely needs to be discussed and developed through next W3C discussions.
The contextual data for the interest group bid is computed by the browser, in accordance with the publisher policy. It should contain the page URL, the user-agent, information about the ad (format, placement...). In order to make sure that the URL doesn't convey any user-identifying information, the browser or the Gatekeeper could edit it. An example could be: If the exact URL is lemonde.fr/specific_section/Specific_articles/userid=****, the browser or the Gatekeeper should only keep "lemonde.fr/specific_section/Specific_articles" in the IG request. Please note that the contextual bid, run through the ad network and the DSP, is fully independent from the interest group bid.
Yes, the Gatekeeper running the auction would actually solve this concern.
Our understanding is that in the case where the interest group bid wins, The publisher gets:
The advertiser gets:
Adding some noise in bid time reporting should be enough to prevent bridging the data between advertiser and publisher in almost all cases, making such an attack useless. Some similar corner cases could be found with TURTLEDOVE, but without any material impact for the proposal.
As we said above, the browser is sending the request which contains contextual information. He is in charge (by trimming the URL, up to the domain level, should it be necessary) that no PII information is contained in the request. Should the browser see actors systematically trying to share PII using SPARROW or TURTLEDOVE, it could choose (via a well-defined procedure) to prevent them to participate in interest group bids. |
Bonjour Basile, Hi Michael Great proposals and discussion -- it feels like good progress is being made. I've been following both TURTLEDOVE and SPARROW proposals with great interest. Quick comments on a couple of the points to add to the discussion:
If I'm not mistaken, this sounds more and more like certain SSPs/Exchanges could play the role of Gatekeeper, provided they keep the Interest Group based bid req path completely separate and isolated from the Contextual bid req path.
I believe the concern is that if the advertiser/DSP also gets access to a publisher-provided "1st party" user ID in the Contextual bid req path (sans Interest Group), there exists an attack vector where a bidder advertiser/DSP can try to link that pub-provided user ID (and all associated on-pub-site behaviours) as observed from the Contextual bid req path, with the Gatekeeper-sent Interest Group bid req events (and slightly delayed reporting events), which also now identify the publisher and context. Therefore, I'm assuming that there's an assumption / implicit suggestion being made that the Contextual request path should not contain any personal identifiers, including 1st party/pub-provided "user IDs"?
As above... I think there's an assumption being made that the publisher-triggered Contextual request path shall contain no pub-provided User ID. |
Hi @bmilekic, SSPs are, from our perspective, in a good position to step up and assume the role of gatekeepers. Other actors such as cloud providers could be interested as well. Another possibility is for current buyers to "split" into independent entities with a strict Chinese wall implementation. All in all, there are definitely several pre-existing actors that could impersonate this new role, which would provide a hefty dose of variety and competition, resulting in more innovation. I want to clarify one point: in SPARROW, contrary to TURTLEDOVE, there is only one request including contextual AND interest-based signal. In the diagram here the contextual request with grey arrows passes through a direct relationship between the advertiser and the publisher, completely outside of the Privacy Sandbox (such direct contextual calls would also exist in a more complete TURTLEDOVE diagram, on top of those going through inside the privacy sandbox). The point you make about the potential attack vector is still a valid one. However, we think that the delay we propose would make this kind of attacks of such a low return on investment that they become irrelevant from a business perspective. |
If possible, I'd like to keep the question "How can we trust the Gatekeeper?" separate from this issue. We'll certainly need to talk about who could be appropriately trustworthy. But I plan to focus on what we can design if we assume the Gatekeeper is trusted by browsers and ad tech alike.
@BasileLeparmentier This seems hard even in the case where it's unintentional — which evidently happens plenty, if the search results for [PII in URLs] are any indication. It would be much worse if the information were being deliberately hidden. I would much rather have a system in which the contextual ad request is allowed to contain all the context, including the real URL and any other first-party information the publisher wants to use. (Note that this is exactly the opposite of what @bmilekic said, but I think in line with @Pl-Mrcy's reply.) And more broadly, I don't want to put the browser or the Gatekeeper in the position of needing to police the information sent through some channel. Instead I want a design in which there just is no channel to join up information that needs to remain separate. Suppose we agree that (1) we don't want a way for the publisher to learn the interest groups of a visitor to their site, and (2) we don't want to police the contents of the contextual targeting URL. The only way to satisfy those requirements is if the Gatekeeper is the only server that gets to know both the URL and the interest group at the same time. Is there some variant of SPARROW that meets this bar? |
I think we first need to call out whether the advertiser needs to know the publisher or, more granularly, the page (not necessarily the full url). I think we can solve for this either way, but we should start with what is needed. I don't think the solution lies in policing per se, but in having a protocol for providing only the necessary information for economic viability in a way the promotes the privacy goals. |
We understand your concerns about how user information may leak using the contextual information in the request. Please be assured that we share your concerns and that we want to find the best solution for all actors. Even interest-based advertising requires some form of contextual data. Information about the "printing environment" (placement size, nature of the page content, etc.) are not a nice bonus but a must-have if we want the solution to be actually used by advertisers and publishers. The contextual information could be used for many different purposes:
We obviously want to curtail the first and champion the others since they (among others) are essential to the ad business. It would still theoretically be possible to leverage it in a very limited way and it indeed doesn't cover all possible cases on paper. However, we want to make sure that the attack is arduous enough to make it economically irrelevant. Although an attacker could theoretically work to expose the interest groups of some users and eventually succeed every so often, we could be assured that this attack won't ever occur at scale, thus making the cost-benefit ratio strongly unfavourable, eroding the very motive for such attacks. We are working on putting figures on the actual privacy risks associated with different levels of granularity for publisher information/latency, in a way that would be easily replicable by other parties. Is this line of reasoning acceptable to you? Or, are you keen to accept only a solution that would present no theoretical breach, no matter how small, and would cover ALL cases by technical means only? |
I'm confused about what you think requires less privacy here. Take the use case of brand safety for publisher and advertiser. This is explicitly within scope even for TURTLEDOVE; I don't see why SPARROW changes any of this. Quoting from my original explainer:
Publisher brand safety could work the same way: the metadata about an interest-group-targeted ad includes its topics (as determined by a server, just like today), publisher controls let them pick what topics are allowed/blocked, and on-device JS compares the two. The key point here is that the sell-side contributes contextual topics, and rules about ad topics, while the buy-side contributes ad topics, and rules about publisher context. Sure, maybe the DSP wants to evaluate the URL on its own and not trust the SSP; it can do that as part of the contextual ad request. Likewise the SSP may want to run its own analysis of the creative, and it can do that during some review, just as it does today. None of that needs to change. At some point these two things need to be joined, with each set of rules evaluating the corresponding state. Whether that's implemented in JS (TURTLEDOVE) or on a trusted server (SPARROW) doesn't change the fact that it can be done without giving any information back to the publisher or advertiser. I understand many advantages of moving things to a trusted server — for example, freedom to have large ML models, real-time adjustment of campaigns, and not needing to expose your decision logic to competitors. Those are all clear benefits. But if we worked out the server trust question, it seems like you can get all of those without offering new opportunities for tracking. |
I am glad that we agree on the benefits brought by a gatekeeper. Our perspective if that brand safety is two-fold: there is a live component (at bidding time) and a reporting element. One cannot work without the other. If you can't observe any "wrong-doing" in the reporting, you cannot properly update the rules or block-list applied at bid-time. Let me take two simple examples to highlight where the system you describe wouldn't be enough:
These two examples particularly underline the fact that brand safety cannot be handled with an "on average" policy. Even one case could go viral and damage the advertiser brand or the publisher brand. Currently, the availability of detailed information at the display granularity ensures that any infringement upon the defined policy (by a deliberate attacker or by mistake) can be spotted and the responsible held accountable. |
Based on your description, it seems like this has nothing to do with the question of whether the auction happens on a trusted server vs in the browser. In both of your examples, the goal is the publisher seeing a report of "all interest group ads that appeared on my site." The aggregate reporting API would already offer that capability for any interest group that triggers showing enough ads on the site. Your worry is about ads that appear very few times (below some aggregate reporting threshold) and also are mis-classified by the ad servers responsible for filtering ad eligibility. But it seems to me that TURTLEDOVE puts the publisher in a better position to handle this threat that in the current RTB market, for two different reasons:
These guarantees, which the browser can be sure are true and which dramatically improve accountability, seem to me like a huge win to offset the risk of a mis-classified ad campaign that happens to show a single-digit number of impressions on a site. |
Whether the auction happens on a trusted server vs in the browser has nothing to do with it indeed. We really like this idea of central ad control dashboard such as what you described above.
Are you saying that the threshold for reporting would be around 10? Lastly, the risk here is that there could be many campaigns with |
We published an analysis on the impact of thresholds on publisher and advertiser reporting in this repository We added the analysis pseudo-script so that other actors in the industry can run it on their own data. |
Thank you for your attention to TURTLEDOVE and your desire to improve on it! To help me understand SPARROW, let me ask some questions about the flow of information between the browser, the Gatekeeper, and the ad network.
For the purposes of this Issue, I'll assume that everyone agrees that the Gatekeeper is perfectly trustworthy.
Does this mean the "contextual data" is all computed by the browser and the publisher's ad network? In TURTLEDOVE, by contrast, it's possible for each DSP to learn the URL of the page and compute its own contextual signals (discussion).
First, this seems like a very large information channel, probably hundreds of bits. What prevents the set of bids from encoding lots of data that we are trying to keep private from the ad network?
Second, it seems like the ad network sees these bids (from the Gatekeeper) and the contextual request (directly from the browser) at nearly the same moment. It seems like it would be straightforward to match the two up, since the bids can be influenced by (and therefore can encode) contextual signals. This would make the information leak in the previous paragraph even worse.
If the Gatekeeper is already being trusted to run the ad network's code faithfully and keep it secret (when producing the bids), could it run the auction code that selects the winning bid as well?
This event-level reporting seems like another opportunity for the ad network to join contextual information with interest group membership directly (or to conclusively join the two ad requests of a minute earlier, if that hasn't already been done).
TURTLEDOVE deals with this by only allowing aggregated reporting for information derived from both contextual and interest-group information, and we've had some discussion about the latency. But it doesn't sound like this is just about an hour vs a minute, since you later say "reporting data at the display level, with the interest group and publisher information, allows for the advertisers to learn better ML models".
It seems like knowing the interest group, the publisher, the bid, and the minute is more than enough to know the exact event, and so join the interest group with the user's publisher-site identity.
Can you see any way to avoid this?
Maybe this sentence sums up part of my worries about this proposal. The publisher knows the user's first-party identity ("Hi! I'm NYTimes subscriber 12345!"). All publisher data could be influenced by this: any signals that feed into bidding, the contents of the page, even the URL might contain PII.
I don't see any way to separate publisher data from user-level information. Are you looking at something differently?
The text was updated successfully, but these errors were encountered: