PrivBayes-Implementation is a comprehensive repository that extensively covers the implementation of PrivBayes, a cutting-edge methodology designed to facilitate the release of private datasets while ensuring individual data privacy. The core concept involves leveraging Bayesian Networks to generate synthetic data that mimics the statistical properties of the original dataset while protecting individual privacy through local differentiation techniques.
The main reference paper detailing the PrivBayes algorithm is titled "PrivBayes: Private Data Release via Bayesian Networks", authored by Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Published in the ACM Transactions on Database Systems (TODS) in 2017 (Volume 42, Issue 4, Article 25, 41 pages), the paper introduces the PrivBayes algorithm as a robust framework for releasing private data while preserving the confidentiality of individuals' sensitive information. The DOI for the paper is 10.1145/3134428.
PrivBayes represents a groundbreaking solution aimed at securely releasing private datasets. The algorithm operates by generating synthetic data using Bayesian Networks. This synthesized data retains similar statistical properties to the original dataset while ensuring individual privacy through various privacy-preserving mechanisms. The implementation focuses on locally differentiating privacy levels across records while maintaining the utility of the released dataset.
The repository's file structure is meticulously organized:
-
code: Contains the central implementation of the PrivBayes algorithm.
- util: Houses essential utility functions utilized within the primary functions.
-
data: Manages dataset and output file storage.
- preprocessed-outputs: Stores preprocessed data in categorical format.
- synthetic-outputs: Contains categorized synthetic datasets pre-postprocessing.
- postprocessed-outputs: Holds data post-de-categorization, restoring synthetic datasets to their original format.
- graphs: Stores graphical representations of datasets (original and synthetic), illustrating pairwise attribute occurrences.
- outputs: Contains output images from various test runs, providing an overview of expected results.
- research-paper: Houses the referenced research paper.
The repository offers a variety of datasets. Notably, experiments primarily focus on 'adult.csv' and 'adult_tiny.csv'. Users are encouraged to explore and test other datasets available in the /data/ folder to diversify experimental scenarios.
All necessary requirements are listed in the 'requirements.txt' file. These requirements are automatically installed during the execution of 'setup.py'.
Before executing the main functions, it's crucial to ensure the installation of ektelo and autograd.
Execute 'privbayes.py' to initiate PrivBayes. Input the desired dataset name for processing (from /data/ folder). Upon completion, a graphical representation comparing the two-way occurrences of attributes in the original and synthetic datasets will be stored in /data/graphs/.
The original rights to the PrivBayes paper belong to its authors. The code in this repository references various sources and undergoes numerous tweaks and reconfigurations.
- Arun Ashok Badri : [email protected]
- Sandhya V : [email protected]
- Pavan Kumar J : [email protected]
The referenced paper for PrivBayes: Zhang, J., Cormode, G., Procopiuc, C. M., Srivastava, D., & Xiao, X. (2017). PrivBayes: Private Data Release via Bayesian Networks. ACM Transactions on Database Systems (TODS), 42(4), Article 25, 41 pages. https://doi.org/10.1145/3134428
Feel free to expand or modify any section further to suit your requirements.