Accommodating randomized inputs (a la DP) #19

chris-wood · 2021-03-10T18:33:51Z

@tlepoint pointed out that PrioV2 requires all clients to submit randomized values for the purposes of ensuring that identical inputs don't yield the same output. Basically, with some very small probability, clients flip some bits in their data to maximize aggregate utility and privacy (in terms of differential privacy bounds). Apple's paper on this topic has details.

We should probably design the system such that either clients or aggregators can inject this sort of noise.

cjpatton · 2021-03-18T21:22:06Z

This seems orthogonal to the core protocol logic, since its applicability depends on the data type.

tgeoghegan · 2021-07-30T19:19:59Z

There's an open question that fell out of the review of #91 that feels appropriate for this issue. Aggregators are encouraged to apply differential privacy noise to outputs. Collectors and aggregators may need to agree on DP parameters (e.g., the probability with which values are flipped) so that collectors can de-noise outputs. I think we might need to have a field in PDAParams for differential privacy parameters.

csharrison · 2021-07-30T19:25:03Z

It's worth noting that in many situations (especially with server-added noise), the collectors don't need to do anything special to "de-noise" since it's often going to have a mean of 0. However, collectors may want to know things like the variance of the noise, or what distribution the noise was sampled from so they can understand the effective error.

csharrison · 2022-05-05T16:59:56Z

I was originally going to post this as a new issue but I think this one should suffice. Originally titled "Optional DAP configuration that achieves differential privacy". Apologies in advance for the wall of text, I hope it is useful to the discussion.

Differential privacy (DP) is a useful property of an aggregate system. Among other things, it is a mitigation against Sybil attacks that aim to violate user privacy. This is because differential privacy holds under worst-case assumptions, including when the adversary knows the contents of the entire database except for the targeted user. In other words, malicious clients submitting data to the system does not adversely affect the privacy (in this worst-case sense) of any honest user.

We should consider whether we want to specify an optional configuration in DAP which deployments can use to achieve DP output. @cjpatton mentions here it would be useful, so here are my thoughts:

DP typically involves noise addition, we ought to consider two flavors, since I think both are relevant and useful to this work:

1. Clients add noise to their input

When clients add noise to their input, it is possible to show the system is differentially private (in the “local” model) without any downstream changes in DAP or VDAF. This is a trivial result though, and does not take advantage of anything relevant to DAP.

The security model of DAP is that input is sent to the system through input shares, which eliminates any knowledge in the system of which client sent which report. I believe that, by using a similar line of argument as the ENPA white paper (section 4.5), we can show that using local noise in DAP should always achieve differential privacy in the shuffle model, without any change to the protocol (or restriction on VDAF). This yields much stronger privacy bounds than purely local DP. It will depend on the number of users contributing in a given batch, so increasing the minimum batch size should increase privacy.

Side note: with some specific VDAFs, and some specific types of local noise, you could show stronger privacy bounds than shuffle DP, even if noise is only added on the client. This is similar to the case of each aggregator adding independent noise, where you consider the client to be a 1-record aggregator.

Side note: If clients add noise to their input, it puts at risk the “verifiable” aspect of VDAF, since the input is falsey from the very start. However, we can still add noise such that the output is well-bounded, and downstream VDAFs can expect data within the expected bounds.

2. Noise is added during aggregation

There are a couple of ways I could see specifying this. This is an example of central DP, or more specifically, distributed DP since it is done via distributed noise addition.

2A. VDAFs add noise

The idea here is that a particular VDAF could be implemented such that noise is added to each agg_share independently, resulting in a noisy agg_result that could be shown to be differentially private. Noise should be added in such a way that agg_result has summed noise distributed according to a particular distribution that we know achieves DP (e.g. discrete laplace, etc).

For example, n aggregators adding noise distributed according to difference-of-Polya distributions would, summed together, yield noise distributed according to two-sided geometric distribution (link, theorem 5.1) which we can show central DP bounds for. It is worth mentioning that in these schemes, you can often show (reduced) DP bounds even if only a subset of the aggregators are honestly adding noise, which is exactly the kind of property that you want out of DAP. In general, this is possible when the final distribution you want the output noise to follow is infinitely divisible, or has good infinitely divisible approximations.

VDAFs should probably not make any assumptions about their input shares (i.e. how many are coming from which users, etc). However, if they know both:

The range that inputs to the VDAF must fall under, and
The amount of noise the VDAF adds
Then we can specify in some cases the record-level differential privacy claim. That is, how much privacy loss we could expect (in the worst case) on any particular record in the input.

This means a DP-aware VDAF could parametrize itself with a new PublicParam that encodes a (epsilon, delta) tuple describing the desired record-level differential privacy bounds. Then the VDAF could configure itself to add noise scaled properly to achieve this bound. It’s possible the VDAF would need new constants to advertise bounds on epsilon / delta if only some configurations are possible. VDAFs can also have public parameters specifying the distribution they are sampling from to add noise, so users of the VDAF can properly understand the expected variance from aggregating.

With VDAFs supporting record-level DP, it should be easy to achieve user-level DP in DAP. The easiest way would be to include a new constraint that each user produces exactly the same number of inputs as any other user (say, N inputs) in a given batch (see #210 also). It is straightforward to show (via group privacy, see Lemma 2.2 here) that this would result in user-level privacy loss at most (N*epsilon, N*exp(N*epsilon)*delta). It should be possible to make DP claims in other settings, but it will be a bit more challenging.

2B. Add noise as a post-processing step after VDAF aggregation

VDAFs are already complicated things, so it might be possible for noise to be added outside that system (e.g. by DAP). This is appealing, but introduces some new challenges.

DAP will need to know the effective sensitivity of the VDAF, i.e. how much each record can maximally impact the agg result in order to calibrate the noise. Note that there are multiple ways of measuring sensitivity (L0, L1, L2, Linf, etc.) that could be relevant here, and it might be more optimal to add noise if you know the L2 sensitivity vs. the L1 sensitivity.

When this information is known, and using the type information that DAP should already know about the VDAF, DAP could specify techniques for adding noise to each agg share before they get sent to the collector, based on the exposed sensitivity, a desired privacy level, and the maximum number of input records a user could contribute. We would want generic methods for supporting DP given an arbitrary function F, its sensitivity, type, and #SHARES that a DAP deployment could use.

In offline discussions with @marianapr, she preferred this approach, but in my opinion it feels pretty complicated to decouple noise addition generically from the aggregation function itself. I'd be interested in other opinions though.

Of course, we can also discuss parametrizing DAP as well as @tgeoghegan mentions above.

cc @badih who double checked some of my work here.

degregat · 2022-05-31T20:05:46Z

Some information regarding de-risking of approach 2B:

A prototype extension for libprio based on this approach can be found here, this one being meant to be used as a backend for federated learning.

Properties for the distributed discrete gaussian can be found in this paper by google research.
Looking at the above, one could use noise scaling techniques similar to 2A and would receive an analogous graceful privacy reduction per defecting aggregator (implemented here). Though I'm wondering how to think about the case when a subset is defecting, since then the validity checking from the PCPs breaks down, and we get different problems as well. (This should be true for 2A as well, no?)

There are now prototype tools, to infer sensitivity as well as privacy guarantees of functions, so it sounds more feasible by the day to find out the sensitivity of a given VDAF, without it being too labor intensive.

In any case, it is advisable to use discrete distributions to prevent being vulnerable to floating point attacks.

csharrison · 2022-06-01T14:13:54Z

Thanks @degregat ! I think the distrusted aggregators adding "too much noise" to compromise the result is a good thing to try to protect against. However, I am not sure how necessary it is given that malicious aggregators can already compromise the result in the DAP threat model. In any case it's not obvious to me whether this is easier to achieve in 2A or 2B.

In 2A we could add another round to the initialization step to pass around noise shares.
In 2B perhaps the leader crafts noise shares before interacting with the other aggregators. This would move the idea from "post-processing" to "pre-processing", to take advantage of the VDAFs verification.

There are now prototype tools, to infer sensitivity as well as privacy guarantees of functions, so it sounds more feasible by the day to find out the sensitivity of a given VDAF, without it being too labor intensive.

That is very interesting! Thanks for sharing.

In any case, it is advisable to use discrete distributions to prevent being vulnerable to floating point attacks.

I agree, I think it's generally easier in the MPC to have support only on integers, so I don't think that's a big constraint.

cjpatton · 2022-06-29T15:10:04Z

My preference would be 2B, if we can make it work. I think the threat of an Aggregator fiddling with the noise distribution only matters insofar as it impacts privacy. If only correctness is impacted, we already concede that Aggregators are assumed to execute the protocol correctly.

martinthomson · 2023-04-27T05:13:25Z

We might have a need for 2B. Local/shuffle DP is harder to reason about from a privacy/utility perspective. I would also be OK with the aggregates being initialized with noise before accepting any uploads; the effect is the same.

Has someone done anything to specify this? Has any effort been put into defining something here?

It seems like it might be an extension to the task creation process. Though it seems like that is maybe considered out of scope.

chris-wood · 2023-04-27T21:41:29Z

Local/shuffle DP is harder to reason about from a privacy/utility perspective. I would also be OK with the aggregates being initialized with noise before accepting any uploads; the effect is the same.

Can you say more about why you think privacy/utility is harder to reason about in the local model compared to the central model?

martinthomson · 2023-04-27T22:12:52Z

Because I'm not smart enough, basically.

chris-wood · 2023-05-02T10:30:49Z

Oh, I didn't mean it like that (and generally dismiss your imposter syndrome!). I was just wondering if there was something y'all learned during IPA's development that might be relevant here. For example, perhaps due to the nature of local DP, choosing epsilon in a way where you don't have full visibility into the distribution of input reports means you make more conservative choices (and worsen utility)? 🤷

martinthomson · 2023-05-02T12:33:41Z

Mostly, the balance between utility and privacy is poor with local DP. Individuals don't get great assurances about how much their contributions are protected and the aggregates tend to be overly noisy. That's even if you use all the tricks available, like shuffle DP.

The main concern I'd have from the privacy perspective is that you might do something like randomized response with an effective/local epsilon of 10 or even higher, which is basically meaningless for that contribution. 1/20 or lower chance of picking a false option isn't worth much if you happen to be singled out; worse if that can happen more than once (as is likely here, as in IPA). But from a utility perspective, that's a whole lot of noise. 10000 people each with with a 1/20 chance of lying suddenly looks like $\mathcal {N}(0, 475)$, which is quite a bit when you have conversion rates of 1% being considered very high, such that your signal is $\sim 100$ relative to that noise.

This system might be different in that the aggregates might be a bit less diffuse. Measuring higher probability events or contributions that are more consistent might lend itself to higher rates of noise in the input, but it is hard to compete with central DP whatever way you cut it.

csharrison · 2023-05-02T21:12:24Z

Shuffle DP is a very generic framework and statements like "utility and privacy is poor" are not accurate for all deployments and algorithms, especially tuned mechanisms that aren't just generically LDP mechanisms "amplified" by shuffling. Recent research into some shuffle DP frameworks can show very competitive performance with central DP, especially in the "multi-message" shuffle framework.

The main benefit of a shuffle notion here is that it's simpler for the system to handle, as the noise mechanism comes "for free" outside of the aggregator and won't require much infrastructure to support. For that reason it seems like we should just support it IMO. We already have deployed systems (e.g. ENPA) that use this kind of privacy and it would be beneficial if they could be specified here.

cjpatton · 2023-09-20T15:13:18Z

We have some text in https://www.ietf.org/archive/id/draft-ietf-ppm-dap-07.html#section-7.5. We're also working on a dedicated draft for DP in DAP, which we hope will be ready for WG adoption call at 118: https://github.com/wangshan/draft-wang-ppm-differential-privacy

As a result of this work, we may end up needing changes to the DAP spec, but for now we think it's fine to shift this conversation to the drraft.

cjpatton changed the title ~~Accommodating randomized inputs~~ Accommodating randomized inputs (a la DP) Apr 14, 2021

tgeoghegan mentioned this issue Jul 30, 2021

Add differential privacy as a threat mitigation #91

Merged

csharrison mentioned this issue May 5, 2022

Update security considerations #235

Merged

csharrison mentioned this issue Jun 29, 2022

Accommodations for differential privacy cfrg/draft-irtf-cfrg-vdaf#94

Closed

cjpatton added the differential privacy label Nov 1, 2022

MxmUrw mentioned this issue Feb 1, 2023

Implementing differential privacy for the bounded-norm vector type divviup/libprio-rs#440

Open

martinthomson mentioned this issue Apr 27, 2023

On demand collection #444

Closed

cjpatton closed this as not planned Won't fix, can't repro, duplicate, stale Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accommodating randomized inputs (a la DP) #19

Accommodating randomized inputs (a la DP) #19

chris-wood commented Mar 10, 2021

cjpatton commented Mar 18, 2021

tgeoghegan commented Jul 30, 2021

csharrison commented Jul 30, 2021

csharrison commented May 5, 2022

degregat commented May 31, 2022 •

edited

Loading

csharrison commented Jun 1, 2022

cjpatton commented Jun 29, 2022

martinthomson commented Apr 27, 2023

chris-wood commented Apr 27, 2023

martinthomson commented Apr 27, 2023

chris-wood commented May 2, 2023

martinthomson commented May 2, 2023

csharrison commented May 2, 2023

cjpatton commented Sep 20, 2023

Accommodating randomized inputs (a la DP) #19

Accommodating randomized inputs (a la DP) #19

Comments

chris-wood commented Mar 10, 2021

cjpatton commented Mar 18, 2021

tgeoghegan commented Jul 30, 2021

csharrison commented Jul 30, 2021

csharrison commented May 5, 2022

1. Clients add noise to their input

2. Noise is added during aggregation

2A. VDAFs add noise

2B. Add noise as a post-processing step after VDAF aggregation

degregat commented May 31, 2022 • edited Loading

csharrison commented Jun 1, 2022

cjpatton commented Jun 29, 2022

martinthomson commented Apr 27, 2023

chris-wood commented Apr 27, 2023

martinthomson commented Apr 27, 2023

chris-wood commented May 2, 2023

martinthomson commented May 2, 2023

csharrison commented May 2, 2023

cjpatton commented Sep 20, 2023

degregat commented May 31, 2022 •

edited

Loading