Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accommodating randomized inputs (a la DP) #19

Closed
chris-wood opened this issue Mar 10, 2021 · 14 comments
Closed

Accommodating randomized inputs (a la DP) #19

chris-wood opened this issue Mar 10, 2021 · 14 comments

Comments

@chris-wood
Copy link
Collaborator

@tlepoint pointed out that PrioV2 requires all clients to submit randomized values for the purposes of ensuring that identical inputs don't yield the same output. Basically, with some very small probability, clients flip some bits in their data to maximize aggregate utility and privacy (in terms of differential privacy bounds). Apple's paper on this topic has details.

We should probably design the system such that either clients or aggregators can inject this sort of noise.

@cjpatton
Copy link
Collaborator

This seems orthogonal to the core protocol logic, since its applicability depends on the data type.

@cjpatton cjpatton changed the title Accommodating randomized inputs Accommodating randomized inputs (a la DP) Apr 14, 2021
@tgeoghegan
Copy link
Collaborator

There's an open question that fell out of the review of #91 that feels appropriate for this issue. Aggregators are encouraged to apply differential privacy noise to outputs. Collectors and aggregators may need to agree on DP parameters (e.g., the probability with which values are flipped) so that collectors can de-noise outputs. I think we might need to have a field in PDAParams for differential privacy parameters.

@csharrison
Copy link
Contributor

It's worth noting that in many situations (especially with server-added noise), the collectors don't need to do anything special to "de-noise" since it's often going to have a mean of 0. However, collectors may want to know things like the variance of the noise, or what distribution the noise was sampled from so they can understand the effective error.

@csharrison
Copy link
Contributor

I was originally going to post this as a new issue but I think this one should suffice. Originally titled "Optional DAP configuration that achieves differential privacy". Apologies in advance for the wall of text, I hope it is useful to the discussion.


Differential privacy (DP) is a useful property of an aggregate system. Among other things, it is a mitigation against Sybil attacks that aim to violate user privacy. This is because differential privacy holds under worst-case assumptions, including when the adversary knows the contents of the entire database except for the targeted user. In other words, malicious clients submitting data to the system does not adversely affect the privacy (in this worst-case sense) of any honest user.

We should consider whether we want to specify an optional configuration in DAP which deployments can use to achieve DP output. @cjpatton mentions here it would be useful, so here are my thoughts:

DP typically involves noise addition, we ought to consider two flavors, since I think both are relevant and useful to this work:

1. Clients add noise to their input

When clients add noise to their input, it is possible to show the system is differentially private (in the “local” model) without any downstream changes in DAP or VDAF. This is a trivial result though, and does not take advantage of anything relevant to DAP.

The security model of DAP is that input is sent to the system through input shares, which eliminates any knowledge in the system of which client sent which report. I believe that, by using a similar line of argument as the ENPA white paper (section 4.5), we can show that using local noise in DAP should always achieve differential privacy in the shuffle model, without any change to the protocol (or restriction on VDAF). This yields much stronger privacy bounds than purely local DP. It will depend on the number of users contributing in a given batch, so increasing the minimum batch size should increase privacy.

Side note: with some specific VDAFs, and some specific types of local noise, you could show stronger privacy bounds than shuffle DP, even if noise is only added on the client. This is similar to the case of each aggregator adding independent noise, where you consider the client to be a 1-record aggregator.

Side note: If clients add noise to their input, it puts at risk the “verifiable” aspect of VDAF, since the input is falsey from the very start. However, we can still add noise such that the output is well-bounded, and downstream VDAFs can expect data within the expected bounds.

2. Noise is added during aggregation

There are a couple of ways I could see specifying this. This is an example of central DP, or more specifically, distributed DP since it is done via distributed noise addition.

2A. VDAFs add noise

The idea here is that a particular VDAF could be implemented such that noise is added to each agg_share independently, resulting in a noisy agg_result that could be shown to be differentially private. Noise should be added in such a way that agg_result has summed noise distributed according to a particular distribution that we know achieves DP (e.g. discrete laplace, etc).

For example, n aggregators adding noise distributed according to difference-of-Polya distributions would, summed together, yield noise distributed according to two-sided geometric distribution (link, theorem 5.1) which we can show central DP bounds for. It is worth mentioning that in these schemes, you can often show (reduced) DP bounds even if only a subset of the aggregators are honestly adding noise, which is exactly the kind of property that you want out of DAP. In general, this is possible when the final distribution you want the output noise to follow is infinitely divisible, or has good infinitely divisible approximations.

VDAFs should probably not make any assumptions about their input shares (i.e. how many are coming from which users, etc). However, if they know both:

  • The range that inputs to the VDAF must fall under, and
  • The amount of noise the VDAF adds
    Then we can specify in some cases the record-level differential privacy claim. That is, how much privacy loss we could expect (in the worst case) on any particular record in the input.

This means a DP-aware VDAF could parametrize itself with a new PublicParam that encodes a (epsilon, delta) tuple describing the desired record-level differential privacy bounds. Then the VDAF could configure itself to add noise scaled properly to achieve this bound. It’s possible the VDAF would need new constants to advertise bounds on epsilon / delta if only some configurations are possible. VDAFs can also have public parameters specifying the distribution they are sampling from to add noise, so users of the VDAF can properly understand the expected variance from aggregating.

With VDAFs supporting record-level DP, it should be easy to achieve user-level DP in DAP. The easiest way would be to include a new constraint that each user produces exactly the same number of inputs as any other user (say, N inputs) in a given batch (see #210 also). It is straightforward to show (via group privacy, see Lemma 2.2 here) that this would result in user-level privacy loss at most (N*epsilon, N*exp(N*epsilon)*delta). It should be possible to make DP claims in other settings, but it will be a bit more challenging.

2B. Add noise as a post-processing step after VDAF aggregation

VDAFs are already complicated things, so it might be possible for noise to be added outside that system (e.g. by DAP). This is appealing, but introduces some new challenges.

DAP will need to know the effective sensitivity of the VDAF, i.e. how much each record can maximally impact the agg result in order to calibrate the noise. Note that there are multiple ways of measuring sensitivity (L0, L1, L2, Linf, etc.) that could be relevant here, and it might be more optimal to add noise if you know the L2 sensitivity vs. the L1 sensitivity.

When this information is known, and using the type information that DAP should already know about the VDAF, DAP could specify techniques for adding noise to each agg share before they get sent to the collector, based on the exposed sensitivity, a desired privacy level, and the maximum number of input records a user could contribute. We would want generic methods for supporting DP given an arbitrary function F, its sensitivity, type, and #SHARES that a DAP deployment could use.

In offline discussions with @marianapr, she preferred this approach, but in my opinion it feels pretty complicated to decouple noise addition generically from the aggregation function itself. I'd be interested in other opinions though.

Of course, we can also discuss parametrizing DAP as well as @tgeoghegan mentions above.

cc @badih who double checked some of my work here.

@degregat
Copy link

degregat commented May 31, 2022

Some information regarding de-risking of approach 2B:

A prototype extension for libprio based on this approach can be found here, this one being meant to be used as a backend for federated learning.

Properties for the distributed discrete gaussian can be found in this paper by google research.
Looking at the above, one could use noise scaling techniques similar to 2A and would receive an analogous graceful privacy reduction per defecting aggregator (implemented here). Though I'm wondering how to think about the case when a subset is defecting, since then the validity checking from the PCPs breaks down, and we get different problems as well. (This should be true for 2A as well, no?)

There are now prototype tools, to infer sensitivity as well as privacy guarantees of functions, so it sounds more feasible by the day to find out the sensitivity of a given VDAF, without it being too labor intensive.

In any case, it is advisable to use discrete distributions to prevent being vulnerable to floating point attacks.

@csharrison
Copy link
Contributor

Thanks @degregat ! I think the distrusted aggregators adding "too much noise" to compromise the result is a good thing to try to protect against. However, I am not sure how necessary it is given that malicious aggregators can already compromise the result in the DAP threat model. In any case it's not obvious to me whether this is easier to achieve in 2A or 2B.

  • In 2A we could add another round to the initialization step to pass around noise shares.
  • In 2B perhaps the leader crafts noise shares before interacting with the other aggregators. This would move the idea from "post-processing" to "pre-processing", to take advantage of the VDAFs verification.

There are now prototype tools, to infer sensitivity as well as privacy guarantees of functions, so it sounds more feasible by the day to find out the sensitivity of a given VDAF, without it being too labor intensive.

That is very interesting! Thanks for sharing.

In any case, it is advisable to use discrete distributions to prevent being vulnerable to floating point attacks.

I agree, I think it's generally easier in the MPC to have support only on integers, so I don't think that's a big constraint.

@cjpatton
Copy link
Collaborator

My preference would be 2B, if we can make it work. I think the threat of an Aggregator fiddling with the noise distribution only matters insofar as it impacts privacy. If only correctness is impacted, we already concede that Aggregators are assumed to execute the protocol correctly.

@martinthomson
Copy link
Contributor

We might have a need for 2B. Local/shuffle DP is harder to reason about from a privacy/utility perspective. I would also be OK with the aggregates being initialized with noise before accepting any uploads; the effect is the same.

Has someone done anything to specify this? Has any effort been put into defining something here?

It seems like it might be an extension to the task creation process. Though it seems like that is maybe considered out of scope.

@chris-wood
Copy link
Collaborator Author

Local/shuffle DP is harder to reason about from a privacy/utility perspective. I would also be OK with the aggregates being initialized with noise before accepting any uploads; the effect is the same.

Can you say more about why you think privacy/utility is harder to reason about in the local model compared to the central model?

@martinthomson
Copy link
Contributor

Because I'm not smart enough, basically.

@chris-wood
Copy link
Collaborator Author

Oh, I didn't mean it like that (and generally dismiss your imposter syndrome!). I was just wondering if there was something y'all learned during IPA's development that might be relevant here. For example, perhaps due to the nature of local DP, choosing epsilon in a way where you don't have full visibility into the distribution of input reports means you make more conservative choices (and worsen utility)? 🤷

@martinthomson
Copy link
Contributor

Mostly, the balance between utility and privacy is poor with local DP. Individuals don't get great assurances about how much their contributions are protected and the aggregates tend to be overly noisy. That's even if you use all the tricks available, like shuffle DP.

The main concern I'd have from the privacy perspective is that you might do something like randomized response with an effective/local epsilon of 10 or even higher, which is basically meaningless for that contribution. 1/20 or lower chance of picking a false option isn't worth much if you happen to be singled out; worse if that can happen more than once (as is likely here, as in IPA). But from a utility perspective, that's a whole lot of noise. 10000 people each with with a 1/20 chance of lying suddenly looks like $\mathcal {N}(0, 475)$, which is quite a bit when you have conversion rates of 1% being considered very high, such that your signal is $\sim 100$ relative to that noise.

This system might be different in that the aggregates might be a bit less diffuse. Measuring higher probability events or contributions that are more consistent might lend itself to higher rates of noise in the input, but it is hard to compete with central DP whatever way you cut it.

@csharrison
Copy link
Contributor

Shuffle DP is a very generic framework and statements like "utility and privacy is poor" are not accurate for all deployments and algorithms, especially tuned mechanisms that aren't just generically LDP mechanisms "amplified" by shuffling. Recent research into some shuffle DP frameworks can show very competitive performance with central DP, especially in the "multi-message" shuffle framework.

The main benefit of a shuffle notion here is that it's simpler for the system to handle, as the noise mechanism comes "for free" outside of the aggregator and won't require much infrastructure to support. For that reason it seems like we should just support it IMO. We already have deployed systems (e.g. ENPA) that use this kind of privacy and it would be beneficial if they could be specified here.

@cjpatton
Copy link
Collaborator

We have some text in https://www.ietf.org/archive/id/draft-ietf-ppm-dap-07.html#section-7.5. We're also working on a dedicated draft for DP in DAP, which we hope will be ready for WG adoption call at 118: https://github.com/wangshan/draft-wang-ppm-differential-privacy

As a result of this work, we may end up needing changes to the DAP spec, but for now we think it's fine to shift this conversation to the drraft.

@cjpatton cjpatton closed this as not planned Won't fix, can't repro, duplicate, stale Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants