COPULA-SHIRLEY

Implementation for the COPULA-SHIRLEY framework for differerentially-private synthetic data generation. This implementation is used on the following paper:

Gambs, S., Ladouceur, F., Laurent, A. and Roy-Gaumond, A., 2021. Growing synthetic data through differentially-private vine copulas. Proceedings on Privacy Enhancing Technologies, 3, pp.122-141. https://www.petsymposium.org/2021/files/papers/issue3/popets-2021-0040.pdf

Requirements

R

r-base (4.0.3)
rvinecopulib (0.5.5.1.1)

Python

numpy (1.19.5)
scipy (1.6.0)
pandas (1.2.1)
scikit-learn (0.24.1)
category_encoders (2.2.2)
diffprivlib (0.4.0)
xgboost (1.3.0)
rpy2 (3.4.2)

Example of setting the conda environnement

With conda installed, create a new environnement (here we named it copulashirley): conda create -n copulashirley
Activate the environnement: conda activate copulashirley
You might have to add conda-forge as a channel for the following commands: conda config --add channels conda-forge
In the copulashirley environnement first install r-base with the following command: conda install r-base=4.0.3
Run the R CLI in the conda by running: R
In the R CLI, install the rvinecopulib package and dependencies with the command: install.packages('rvinecopulib')
Quit the R CLI and in the copulashirley environnement install the Python packages with the following command: conda install numpy=1.19.5 scipy=1.6.0 pandas=1.2.1 scikit-learn=0.24.1 category_encoders=2.2.2 diffprivlib=0.4.0 xgboost=1.3.0 rpy2=3.4.2

How-to

To output synthetic data: run main.py
To output tests scores using k-fold cross-validation: run crossval.py

Parameters for main.py

--dataset, default='adult' #Input dataset ('adult', 'compas' or 'texas_hospital')
--model, default='cop-shirl', choices=['cop-shirl', 'privbayes', 'dpcopula', 'dp-histogram'] #Generative model to use
--seed, type=int, default=76543
--n-sample, type=int, default=None #Number of synthetic samples to generate
--categorical-encoder, default=None, choices=('ORD', 'WOE', 'GLMM', 'OHE') #The encoder for categorical attributes
--cat-encoder-target, type=str, default=None #The target attribute for surpervised categorical encoder ('WOE' and 'GLMM')
--dp-epsilon, type=float, default=1.0 #The global budget for differential-privacy
--dp-mechanism, default='Laplace', choices=['Laplace', 'Gaussian', 'Geometric'] #The mechanism used for do-histograms computation in copula-shirley
--dp-global-sens, type=int, default=2 #The global sensitivity for the dp mechanism
--dp-gaussian-delta, type=float, default=0.001 #The delta for gaussian mechanism
--vine-sample-ratio, type=float, default=0.5 #The ratio for model vs. dp-histogram training (0.7 means 70% of data will be used as pseudo-observations for the vine-copula model and 30% will be used for dp-histograms)
--vine-family-set, type=str, default='all' #See rvinecopulib reference
--vine-par-method, type=str, default='mle' #See rvinecopulib reference
--vine-nonpar-method, type=str, default='constant' #See rvinecopulib reference
--vine-selcrit, type=str, default='aic' #See rvinecopulib reference
--vine-trunc-lvl, type=int, default=None #See rvinecopulib reference
--vine-tree-crit, type=str, default='tau' #See rvinecopulib reference
--privbayes-degree-max, type=int, default=3 #The maximum number of children for PrivBayes network  
--output-dir, type=str, default='./out' #Output directory
--n-cores, type=int, default=None #Number of cores to use (if None, inferred)

Parameters for crossval.py

--n-folds, type=int, default=5 #The number of folds for k-fold cross-validation
--dataset, default='adult' #Input dataset ('adult', 'compas' or 'texas_hospital')
--model, default='cop-shirl', choices=['cop-shirl', 'privbayes', 'dpcopula', 'dp-histogram'] #Generative model to use
--seed, type=int, default=76543
--n-sample, type=int, default=None #Number of synthetic samples to generate
--categorical-encoder, default=None, choices=('ORD', 'WOE', 'GLMM', 'OHE') #The encoder for categorical attributes
--cat-encoder-target, type=str, default=None #The target attribute for surpervised categorical encoder ('WOE' and 'GLMM')
--dp-epsilon, type=float, default=1.0 #The global budget for differential-privacy
--dp-mechanism, default='Laplace', choices=['Laplace', 'Gaussian', 'Geometric'] #The mechanism used for do-histograms computation in copula-shirley
--dp-global-sens, type=int, default=2 #The global sensitivity for the dp mechanism
--dp-gaussian-delta, type=float, default=0.001 #The delta for gaussian mechanism
--vine-sample-ratio, type=float, default=0.5 #The ratio for model vs. dp-histogram training (0.7 means 70% of data will be used as pseudo-observations for the vine-copula model and 30% will be used for dp-histograms)
--vine-family-set, type=str, default='all' #See rvinecopulib reference
--vine-par-method, type=str, default='mle' #See rvinecopulib reference
--vine-nonpar-method, type=str, default='constant' #See rvinecopulib reference
--vine-selcrit, type=str, default='aic' #See rvinecopulib reference
--vine-trunc-lvl, type=int, default=None #See rvinecopulib reference
--vine-tree-crit, type=str, default='tau' #See rvinecopulib reference
--privbayes-degree-max, type=int, default=3 #The maximum number of children for PrivBayes network  
--MIA-n, type=int, default=500 #The number of synthetic profiles used per iteration of the Membership Inference Attack
--MIA-metric, type=str, default='hamming' #The distance used for the MIA
--MIA-n-iter, type=int, default=50 #The number of iteration
--output-dir, type=str, default='./out' #Output directory
--n-cores, type=int, default=None #Number of cores to use (if None, inferred)
--verbose, type=bool, default=True

Code Author

Alexandre Roy-Gaumond

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
DPCopula		DPCopula
Datasets		Datasets
PrivBayes		PrivBayes
copula_shirley		copula_shirley
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
crossval.py		crossval.py
data.py		data.py
dpcopula.py		dpcopula.py
main.py		main.py
preprocess.py		preprocess.py
privacy_test.py		privacy_test.py
privbayes.py		privbayes.py
requirements.txt		requirements.txt
setup.py		setup.py
transform.py		transform.py
utility_tests.py		utility_tests.py
vine.py		vine.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COPULA-SHIRLEY

Requirements

R

Python

Example of setting the conda environnement

How-to

Parameters for main.py

Parameters for crossval.py

Code Author

About

Releases

Packages

Languages

License

lbeziaud/copula-shirley

Folders and files

Latest commit

History

Repository files navigation

COPULA-SHIRLEY

Requirements

R

Python

Example of setting the conda environnement

How-to

Parameters for main.py

Parameters for crossval.py

Code Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages