-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accommodating randomized inputs (a la DP) #19
Comments
This seems orthogonal to the core protocol logic, since its applicability depends on the data type. |
There's an open question that fell out of the review of #91 that feels appropriate for this issue. Aggregators are encouraged to apply differential privacy noise to outputs. Collectors and aggregators may need to agree on DP parameters (e.g., the probability with which values are flipped) so that collectors can de-noise outputs. I think we might need to have a field in |
It's worth noting that in many situations (especially with server-added noise), the collectors don't need to do anything special to "de-noise" since it's often going to have a mean of 0. However, collectors may want to know things like the variance of the noise, or what distribution the noise was sampled from so they can understand the effective error. |
I was originally going to post this as a new issue but I think this one should suffice. Originally titled "Optional DAP configuration that achieves differential privacy". Apologies in advance for the wall of text, I hope it is useful to the discussion. Differential privacy (DP) is a useful property of an aggregate system. Among other things, it is a mitigation against Sybil attacks that aim to violate user privacy. This is because differential privacy holds under worst-case assumptions, including when the adversary knows the contents of the entire database except for the targeted user. In other words, malicious clients submitting data to the system does not adversely affect the privacy (in this worst-case sense) of any honest user. We should consider whether we want to specify an optional configuration in DAP which deployments can use to achieve DP output. @cjpatton mentions here it would be useful, so here are my thoughts: DP typically involves noise addition, we ought to consider two flavors, since I think both are relevant and useful to this work: 1. Clients add noise to their inputWhen clients add noise to their input, it is possible to show the system is differentially private (in the “local” model) without any downstream changes in DAP or VDAF. This is a trivial result though, and does not take advantage of anything relevant to DAP. The security model of DAP is that input is sent to the system through input shares, which eliminates any knowledge in the system of which client sent which report. I believe that, by using a similar line of argument as the ENPA white paper (section 4.5), we can show that using local noise in DAP should always achieve differential privacy in the shuffle model, without any change to the protocol (or restriction on VDAF). This yields much stronger privacy bounds than purely local DP. It will depend on the number of users contributing in a given batch, so increasing the minimum batch size should increase privacy. Side note: with some specific VDAFs, and some specific types of local noise, you could show stronger privacy bounds than shuffle DP, even if noise is only added on the client. This is similar to the case of each aggregator adding independent noise, where you consider the client to be a 1-record aggregator. Side note: If clients add noise to their input, it puts at risk the “verifiable” aspect of VDAF, since the input is falsey from the very start. However, we can still add noise such that the output is well-bounded, and downstream VDAFs can expect data within the expected bounds. 2. Noise is added during aggregationThere are a couple of ways I could see specifying this. This is an example of central DP, or more specifically, distributed DP since it is done via distributed noise addition. 2A. VDAFs add noiseThe idea here is that a particular VDAF could be implemented such that noise is added to each For example, n aggregators adding noise distributed according to difference-of-Polya distributions would, summed together, yield noise distributed according to two-sided geometric distribution (link, theorem 5.1) which we can show central DP bounds for. It is worth mentioning that in these schemes, you can often show (reduced) DP bounds even if only a subset of the aggregators are honestly adding noise, which is exactly the kind of property that you want out of DAP. In general, this is possible when the final distribution you want the output noise to follow is infinitely divisible, or has good infinitely divisible approximations. VDAFs should probably not make any assumptions about their input shares (i.e. how many are coming from which users, etc). However, if they know both:
This means a DP-aware VDAF could parametrize itself with a new With VDAFs supporting record-level DP, it should be easy to achieve user-level DP in DAP. The easiest way would be to include a new constraint that each user produces exactly the same number of inputs as any other user (say, N inputs) in a given batch (see #210 also). It is straightforward to show (via group privacy, see Lemma 2.2 here) that this would result in user-level privacy loss at most 2B. Add noise as a post-processing step after VDAF aggregationVDAFs are already complicated things, so it might be possible for noise to be added outside that system (e.g. by DAP). This is appealing, but introduces some new challenges. DAP will need to know the effective sensitivity of the VDAF, i.e. how much each record can maximally impact the agg result in order to calibrate the noise. Note that there are multiple ways of measuring sensitivity (L0, L1, L2, Linf, etc.) that could be relevant here, and it might be more optimal to add noise if you know the L2 sensitivity vs. the L1 sensitivity. When this information is known, and using the type information that DAP should already know about the VDAF, DAP could specify techniques for adding noise to each agg share before they get sent to the collector, based on the exposed sensitivity, a desired privacy level, and the maximum number of input records a user could contribute. We would want generic methods for supporting DP given an arbitrary function F, its sensitivity, type, and # In offline discussions with @marianapr, she preferred this approach, but in my opinion it feels pretty complicated to decouple noise addition generically from the aggregation function itself. I'd be interested in other opinions though. Of course, we can also discuss parametrizing DAP as well as @tgeoghegan mentions above. cc @badih who double checked some of my work here. |
Some information regarding de-risking of approach 2B: A prototype extension for libprio based on this approach can be found here, this one being meant to be used as a backend for federated learning. Properties for the distributed discrete gaussian can be found in this paper by google research. There are now prototype tools, to infer sensitivity as well as privacy guarantees of functions, so it sounds more feasible by the day to find out the sensitivity of a given VDAF, without it being too labor intensive. In any case, it is advisable to use discrete distributions to prevent being vulnerable to floating point attacks. |
Thanks @degregat ! I think the distrusted aggregators adding "too much noise" to compromise the result is a good thing to try to protect against. However, I am not sure how necessary it is given that malicious aggregators can already compromise the result in the DAP threat model. In any case it's not obvious to me whether this is easier to achieve in 2A or 2B.
That is very interesting! Thanks for sharing.
I agree, I think it's generally easier in the MPC to have support only on integers, so I don't think that's a big constraint. |
My preference would be 2B, if we can make it work. I think the threat of an Aggregator fiddling with the noise distribution only matters insofar as it impacts privacy. If only correctness is impacted, we already concede that Aggregators are assumed to execute the protocol correctly. |
We might have a need for 2B. Local/shuffle DP is harder to reason about from a privacy/utility perspective. I would also be OK with the aggregates being initialized with noise before accepting any uploads; the effect is the same. Has someone done anything to specify this? Has any effort been put into defining something here? It seems like it might be an extension to the task creation process. Though it seems like that is maybe considered out of scope. |
Can you say more about why you think privacy/utility is harder to reason about in the local model compared to the central model? |
Because I'm not smart enough, basically. |
Oh, I didn't mean it like that (and generally dismiss your imposter syndrome!). I was just wondering if there was something y'all learned during IPA's development that might be relevant here. For example, perhaps due to the nature of local DP, choosing epsilon in a way where you don't have full visibility into the distribution of input reports means you make more conservative choices (and worsen utility)? 🤷 |
Mostly, the balance between utility and privacy is poor with local DP. Individuals don't get great assurances about how much their contributions are protected and the aggregates tend to be overly noisy. That's even if you use all the tricks available, like shuffle DP. The main concern I'd have from the privacy perspective is that you might do something like randomized response with an effective/local epsilon of 10 or even higher, which is basically meaningless for that contribution. 1/20 or lower chance of picking a false option isn't worth much if you happen to be singled out; worse if that can happen more than once (as is likely here, as in IPA). But from a utility perspective, that's a whole lot of noise. 10000 people each with with a 1/20 chance of lying suddenly looks like This system might be different in that the aggregates might be a bit less diffuse. Measuring higher probability events or contributions that are more consistent might lend itself to higher rates of noise in the input, but it is hard to compete with central DP whatever way you cut it. |
Shuffle DP is a very generic framework and statements like "utility and privacy is poor" are not accurate for all deployments and algorithms, especially tuned mechanisms that aren't just generically LDP mechanisms "amplified" by shuffling. Recent research into some shuffle DP frameworks can show very competitive performance with central DP, especially in the "multi-message" shuffle framework. The main benefit of a shuffle notion here is that it's simpler for the system to handle, as the noise mechanism comes "for free" outside of the aggregator and won't require much infrastructure to support. For that reason it seems like we should just support it IMO. We already have deployed systems (e.g. ENPA) that use this kind of privacy and it would be beneficial if they could be specified here. |
We have some text in https://www.ietf.org/archive/id/draft-ietf-ppm-dap-07.html#section-7.5. We're also working on a dedicated draft for DP in DAP, which we hope will be ready for WG adoption call at 118: https://github.com/wangshan/draft-wang-ppm-differential-privacy As a result of this work, we may end up needing changes to the DAP spec, but for now we think it's fine to shift this conversation to the drraft. |
@tlepoint pointed out that PrioV2 requires all clients to submit randomized values for the purposes of ensuring that identical inputs don't yield the same output. Basically, with some very small probability, clients flip some bits in their data to maximize aggregate utility and privacy (in terms of differential privacy bounds). Apple's paper on this topic has details.
We should probably design the system such that either clients or aggregators can inject this sort of noise.
The text was updated successfully, but these errors were encountered: