-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Private model training: Improving the efficacy of modelingSignals
#1017
Comments
Thanks a lot Charlie for raising this issue which is of high interest for Criteo as the performance of the ad campaigns that we serve is tightly linked to how efficient and precise our machine learning (ML) algorithms are. First, I want to provide you with some background and details regarding the Criteo ML use-cases where more support from Chrome would be needed. I will restrict these use-cases to bid optimisation which is an important ML use-case on our side and as it is also the focus of this issue. We are operating a whole AI system for bidding which involves (as you mentioned at the end):
We appreciate the fact that you are considering several alternatives to improve the status quo around model training and are committed to help you defining the best one for Criteo and the whole industry. As of today, since no precise technical specification is available, it is quite difficult to arbitrate on the approach to push for. In order to help you in that direction, could you provide more details/insights on the following points: Approach 1 (modelingSignals processed in a trusted server)
Approach 2 (modelingSignals released to reportWin but with local DP)
Thanks, |
Thank you Maxime for sharing detailed thoughts on the use case. We are currently in early stages of exploration and some of these details may change as we finalize the mentioned API design. Sharing some early thoughts for your specific questions below:
We are exploring mechanisms that allow you to generate custom value of modelingSignals exactly the same way as today, but with a relaxation on the size constraint and noising mechanisms different from randomized response. This will come with the caveat of payload being encrypted and only being accessible in TEEs.
We are considering passing encrypted modelingSignals via reportWin the same way as today and not restricting access to other signals in reportWin functionality. The complete reports (which can contain contextual signals) collected via reportWin should be processable in TEEs for model training.
Yes, at a very high-level, the API feature should look very similar to processing reports in ARA aggregate reporting.
Could you confirm if the question is for training models without DP when serving is in TEEs or is it for not applying DP on inference? In general, we are only exploring model training techniques which guarantee differential privacy, irrespective of where we serve the models. Allowing inference on a model, trained without differential privacy, has the risk of leaking sensitive user information. In above settings where the model is trained with DP, inference can potentially be shared in non-noised form.
Could you confirm which metadata are you referring to? In general, depending on the nature of metadata, they can be considered sensitive (example: evaluation loss) and will need to be constrained by differential privacy & privacy budgets.
This remains an open area, where we expect adtech companies to explore different ways of provisioning their privacy budget across production and experimentational model needs.
Yes, we are thinking early solutions might need to train on single TEE machines to ensure data security and user privacy. Adtech companies might have to balance between training data size vs training speed.
We are actively investigating the above mentioned settings and their impact on privacy and utility. We will try to share more details soon.
We are exploring label DP as well as hybrid-DP (sensitive features would be noised, non-sensitive features would not) for local DP.
Yes, we are exploring ways to release noisy counts, value estimates with local DP noise and understand privacy utility tradeoffs for different approaches.
I think this can be considered as long as the budget-split does not regress the privacy of the current modelingSignals API. We can also model this after the Flexible Event API, which similarly allows for users to modify the reports they receive in order to minimize noise while ensuring DP.
With full local DP (or Hybrid-DP) on features additional obfuscation strategies might not be necessary.
We are actively exploring the mechanisms and trying to understand the impact on privacy and utility. We will try to share more details soon. |
This issue aims at helping improve the support for bid optimization in Protected Audiences without impacting the privacy stance of the API. This use-case typically involves:
Models to learn these predictions are typically trained via supervised learning techniques, i.e. where examples are labeled with an outcome (click, conversion, etc).
There are two techniques we are exploring to improve the status quo here:
modelingSignals
can be encrypted and processed in a trusted server environment, where we can offer private model training algorithms.modelingSignals
directly toreportWin
. This could look like changes to the existing randomized response mechanism.Of these two techniques, we think (1) will provide the most utility for this use-case, although it introduces the most complexity to the system.
I am filing this issue to collect feedback about the model training use-case. I think we have a pretty good understanding of the shortcomings of the existing
modelingSignals
approach (mainly from a low dimensionality standpoint). However, there are lots of auxiliary use-cases / developer journeys that are involved with training models, these include:We’re interested in better understanding these kinds of use-cases. What are we missing? Please let us know, through this issue, if there are other use-cases we should consider when thinking through improvements here.
cc @nikunj101 @michaelkleber
The text was updated successfully, but these errors were encountered: