The code in here is related to a prototype for aggregating data across browsers with privacy protection. The mechanism is explained here. Note that the current MPC protocol has some known flaws. This prototype is a proof of concept, and is not the final design.
To build the code, you need to install Bazel first, and there are detailed instructions in the Go files for running the binaries. You can follow our Terraform setup to setup an environment.
The following pipelines are implemented based on the IDPF (Incremental Distributed Point Functions) and Apache Beam. As instructed in the Go files, you can run the pipelines locally, or use other runner engines such as Google Cloud Dataflow. For the latter, you need to have a Google Cloud project first.
There are three main pipelines for the DPF protocol:
-
dpf_aggregate_partial_report
expands the DPF keys to histograms and combines the histograms to get partial aggregation results. -
dpf_generate_partial_report
converts a batch of raw input conversions into partial reports that can be processed by pipelinedpf_aggregate_partial_report
for testing. The raw input conversion data is in the CSV format ofconversion_key,value
, whereconversion_key
andvalue
are integers.pipeline/dpf_test_conversion_data.csv
is an example input conversion file. -
dpf_merge_partial_aggregation
shows an example of how the report origins can obtain the complete aggregation result from the DPF partial results.
There is also a binary dpf_generate_raw_conversion
that shows an example of how we can generate conversions to test the hierarchical DPF key expansion.
-
collector_server
receives the encrypted partial reports sent by the browsers, and batches them according to the specified helper servers. -
aggregator_server
hosts two services: a. providing the shared helper information, including the location where the other helper can find the intermediate results for inter-helper communication; and b. processing the aggregation request passed by PubSub messages. -
browser_simulator
simulates the process how the browser creates the partial reports and sends them to thecollector_server
endpoints.
With the aggregator_server
set up, users can query the aggregation results by sending request with binary service/aggregation_query_tool
. There are two modes for the aggregation with different types of configurations passed to the query tool.
The aggregation is finished in multiple rounds corresponding to different hierarchies. For each hierarchy, the partial reports are aggregated to the prefixes with a certain length of the original bucket IDs. After each round, two helpers exchange and merge the noised hierarchical results so they can figure out the prefixes to be further expanded in the next-level hierarchy. Users need to specify the prefix length and the threshold to filter the prefixes with small values for each hierarchy. Example of the configuration(HierarchicalConfig
):
{
prefix_lengths: [5, 10, 20, 25],
expansion_threshold_per_prefix: [10, 5, 5, 5]
privacy_budget_per_prefix: [.2, .1, .3, .4]
}
The aggregation is finished in one round. Users need to specify the bucket IDs they want to have in the results returned by the helpers. IDs are not included in the configuration will be ignored, while all the ones in the configuration will have noised results. Example of the configuration(DirectConfig
):
{
bucket_ids: [5, 10, 20, 25],
}
Contributions to this repository are always welcome and highly encouraged.
See CONTRIBUTING for more information on how to get started.
Apache 2.0 - See LICENSE for more information.
This is not an officially supported Google product.