This repo targets to provide a unified interface to access and evaluate the same aggregation functionalities in different open-source differential privacy (DP) libraries. With a simple CLI, one can choose the library, the aggregation function, and many other experimental parameters and apply the specified DP measurement to data stored in a .csv
file. The repo also provides both synthetic and real-world example datasets for evaluation purposes. Evaluation results are stored in a .json
file and metrics are provided for repeated experiments. The repo also provides a CLI tool to generate configuration groups for larger-scale comparison experiments.
Get hands-on in 1 minute with our tutorial notebook.
Currently supported aggregation operations:
- COUNT
- SUM
- MEAN
- VAR
- MEDIAN
- QUANTILE
Currently supported libraries:
- diffprivlib 0.5.2 [Homepage] [Example Usage]
- python-dp 1.1.1 [Homepage] [Example Usage]
- opendp 0.6.1 [Homepage] [Example Usage]
- tmlt.analytics 0.4.1 [Homepage] [Example Usage]
- chorus 0.1.3 [Homepage] [Example Usage]
To install dplab, one can use the package on pypi
pip install dplab
Or with source code: clone the repo, switch the working directory, and install the dependencies
git clone [email protected]:camelop/dp_lab.git
cd dp-lab
pip install -e .
To use tmlt
export PYSPARK_PYTHON=/usr/bin/python3
sudo apt install openjdk-8-jre-headless
pip3 install -i https://d3p0voevd56kj6.cloudfront.net python-flint
pip3 install tmlt.analytics
To use chorus, please make sure you have Java runtime installed. (If you have already installed tmlt, it should be fine.)
Run a specific library with the CLI
dplab_run <library> <operation> <input_file> <output_file> <other options>
For example:
dplab_run pydp sum data/1.csv data/1.json -f -r 1000
Other options include:
mode
: Evaluation mode, one can choose from "plain" (no timing/mem measurement), "internal" (internal measurement), or "external" (external tracking).epsilon
: DP parameter, default is set to1
.quant
: Quantile value for QUANTILE operation, a float number between 0 and 1.lb
: The optional value lower bound estimation used when applying certain differential privacy aggregations.ub
: The optional value upper bound estimation used when applying certain differential privacy aggregations.repeat
: How many time should the evaluation repeat.force
: Force to overwrite the output file.debug
: Include debugging information in the output file.python_command
: Python command used to run the script in the external mode.external_sample_interval
: timing/mem consumption sample interval in the external mode.
For more information, please check the main entry file.
# Make sure you are in the root directory of the repo
# Data will be generated in the ./data/ directory
# The procedure will generate about 28GB of data
# To avoid the risk of running out of disk space, you can comment out the performance test lines (Line26-27) in SYN_TARGETS defined in the script
python3 scripts/gen_data.py
Generate the experiment commands, this will generate an ./exp.db.json
file under the working directory (you can also use --location
to specify a different place).
dplab_exp plan --repeat 100 --group_num 100
Queue the experiments for execution
dplab_exp launch --debug
The command updates the results to exp.db.json
.
One can potentially view the results via
python3 scripts/view_exp_db.py