-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Choice of parameters in the aggregation API #249
Comments
Thanks for filing:
Feel free to add to the agenda if you want to discuss in the meeting (https://bit.ly/ara-meeting-notes). |
Hi,
As a side note, we are aware that it is possible to sample the keys to keep only "a small number of keys" on each displays, and that the noise would be scaled up when the number of keys is large anyway. To summarize, this "small number of histogram contribution" might cause yet another big performances drop for the models we could learn; I am personally not confident that at the end of the road the models would be "good enough" to sustain a viable ecosystem. |
Thanks @AlexandreGilotte ! Yes I agree the ML use-case is a clear case where it seems useful to contribute to many aggregate keys at once. I am happy to consider changes to the API to make sure that the use-case can be done in a performant and private way. One thing we should think about is how to make sure the entire system knows how to scale the noise. Our current design only has a single L1 sensitivity so some of the techniques necessarily for more advanced DP composition don't really work assuming worst-case input. This has great benefits in terms of simplicity (you can allocate the L1 budget however you want, and the system can be oblivious to it) but it does not maximize utility for this use-case. This is on the agenda for the meeting today, so let's try to get into these problems :) |
We discussed this in the call yesterday (minutes). I raised two concerns with supporting events contributing to many buckets (~hundreds):
|
Thanks for you answer.
|
|
On a roughly connected subject, it seems to me that using a Gaussian noise (made worthwhile because of Linf) would be uniquely suited for MPC, as the two (or more) servers could independently add their noise, which is then a Gaussian. A quick look at appendix E of the DPF paper shows they use a sum of Laplace noise, which has not this summing property. |
Yeah that's exactly right. The L2 sensitivity is just sqrt(L1*Linf).
I don't think the DPF paper is doing the optimal approach. If both servers sampled from the difference of two Polya RVs the sum would be distributed according to Discrete Laplace (or two-sided geometric). This link has more details. I don't think MPC constrains the design choices here. |
Revisiting this, I am wondering if it is feasible for parties to advertise via some global configuration what kind of sensitivity bounding they are interested in. This would have to be global because we'd need all users to obey the same constraints, and have noise applied in a uniform way downstream in the aggregation service. This would introduce a lot of complications, but it seems like a technique that would generally work without just picking a place in the constraint-space that's a middle ground position. |
I am not sure to follow your proposal. What do you mean by parties and users here? |
Parties: reporting origins Essentially I am thinking of a speculative new mechanism where e.g. criteo.com hosts a file saying "Please bound my contributions such that the L2 sensitivity <= xxx". Then browsers have a mechanism to read these files and apply the appropriate sensitivity bounds on the user contributions, instead of the default L1 bounds we have in the API today. This would increase the contributions allowed per user in cases when you are guaranteed to "spread it out" across multiple buckets. |
Hi,
Thanks again for all the proposals and the interesting discussions. We believe the conversion measurement API has a great potential. There are a few limits we would like to understand though, especially in the aggregation API.
In the explainer for the aggregation API, there are limitations on the contributions which can be made to the histogram. While the cap on L1 value is obvious for differential privacy to work, we do not understand the limit on the number of contributions to be made (small, eg 3).
Also, why should the aggregate report be scoped on the tuple (source_site, attribution_destination)? We believe that there are great benefits in aggregating reports from multiple source_site (or attribution_destination, depending on the use case) in a single request, to lower the overall level of noise.
Thanks a lot for you answer, and we would be happy to discuss this live in the next meeting if needed.
The text was updated successfully, but these errors were encountered: