Skip to content

Latest commit



224 lines (141 loc) · 10.4 KB

File metadata and controls

224 lines (141 loc) · 10.4 KB

Download Datasets 🎨

This benchmark also integrates LEAF, and supports FEMNIST, CelebA. For these datasets, this benchmark does not partition them further.

Most of the datasets supported by this benchmark are integrated into torchvision.datasets, expect Tiny-ImageNet-200, Covid-19, Organ-S/A/CMNIST, DomainNet.

For those datasets, I prepare download scripts (in folder data/download) for you. 🤗


cd data/download

Generic Arguments 🔧

📢 All arguments have their default value.

Arguments for general datasets Description
--dataset, -d The name of dataset.
--iid Non-zero value for randomly partitioning data and disabling all other Non-IID partition methods.
--client_num, -cn The amount of clients.
--split, -sp Chooses from [sample, user]. user: partition clients into train-test groups; sample: partition each client's data samples into train-test groups.
--val_ratio, -vr Propotion of valset data/clients.
--test_ratio, -tr Propotion of testset data/clients.
--plot_distribution, -pd Non-zero value for saving data distribution image.

⭐ For CIFAR-100 specifically, this benchmark supports partitioning it into the superclass category (CIFAR-100's 100 classes can also be classified into 20 superclasses) by setting --super_class to non-zero.

Partition Schemes 🌌


Partition data evenly. Client data distributions are similar to each other. Note that this setting has the highest priority, means that activating this scheme will disable all others.

✨ IID partition can only process partial datasets and combines other Non-IID schemes.

  • --iid: Need to set in [0, 1].
python -d cifar10 --iid 1 -cn 20


# 50% data are partitioned IID, and the rest 50% are partitioned according to dirichlet parititon scheme: Dir(0.1) 
python -d cifar10 --iid 0.5 --alpha 0.1 -cn 20



Refers to Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification (FedAvgM). Dataset would be splitted according to $Dir(\alpha)$. Smaller $\alpha$ means stronger label heterogeneity.

  • --alpha, -a: The parameter for controlling intensity of label heterogeneity.
  • --min_samples_per_client, -ms: The parameter for defining the minimum number of samples each client would be distributed. A small --min_samples_per_client along with small --alpha or big --client_num might considerablely prolong the partition.
python -d cifar10 -a 0.1 -cn 20



Refers to Communication-Efficient Learning of Deep Networks from Decentralized Data (FedAvg). The whole dataset would be evenly splitted into many equal-size shards.

  • --shards, -s: Number of data shards that each client holds. The same partition method as in FedAvg.
python -d cifar10 -s 2 -cn 20


Randomly Assigning Classes

Each client would be allocated data that belongs to -c classes. And classes for each client are randomly choosed.

  • --classes, -c: Number of classes that each client's data belong to.
python -d cifar10 -c 2 -cn 20


Semantic Partition

Refers to What Do We Mean by Generalization in Federated Learning?. Each client's data are correspond to a gaussian distribution that generated by a gaussian mixture model. You can learn the whole process precedure in paper's Appendix D.

  • --semantic, -sm: Non-zero value for performing semantic partition.
  • --efficient_net_type: The type of EfficientNet for computing the embeddings of data.
  • --pca_components: The number of dimension for PCA decomposition.
  • --gmm_max_iter: The maximum number of fitting iteration of the gaussian mixture model.
  • --gmm_init_params: The way for initializing gaussian mixture model (kmeans / random).
  • --use_cuda: Non-zero value for using CUDA to accelerate the computation.
python -d cifar10 -sm 1 -cn 20


Flower Partitioner 🌼

This benchmark also supports external partitioners provided by flwr_datasets, enabling the comparison with built-in partitioning schemes and additional schemese that exist in flwr_datasets. To use flwr partitioners, you need to specify the class path of the partitioner you want to use and all its parameters in a seperate dictionary.


To use flwr's partitioners, internally a mock dataset is created that has a column called label. If the partitioning scheme depends on label information, please insert label as the label column.

This is how you would use the DirichletPartitioner from flwr:

python -d cifar10 -cn 10 \
-fpc DirichletPartitioner \
-fpk '{"alpha": 100.0, "partition_by": "label"}'

Usage 🚀

Synthetic Dataset in FedProx

Refers to Federated Optimization in Heterogeneous Networks . The whole dataset are generated according to $(\alpha, \beta)$. Check the paper for all details.

  • (--gamma, --beta): The parameters $(\alpha, \beta)$ in paper respectively.
  • --dimension: The dimension of synthetic data.
  • --iid: Non-zero value for generating IID synthetic dataset.
python -d synthetic --beta 1 --gamma 1  -cn 20 


The LEAF 🍂

Argument details are in data/femnist/ and data/celeba/

You should set all arguments well already when running

All generic arguments (except -d) in will be deactivated when processing LEAF datasets.

When processing LEAF datasets, only responsible for translating the output of (json data files) to data.npy and targets.npy.

So, in summary, for using LEAF datasets, you need to:

  1. sh [args...]
  2. python -d [femnist, celeba]

Processing DomainNet 🧾

Pre-requisite 🐾

  1. Through data/download/ downloading and decomporessing DomainNet raw data.
  2. cd to data/domain and run python (an interactive wizard).

Default Partitioning Scheme ❤

Running python -d domain without additional arguments would build domain separation partition (each client has only has data from one domain).

  • Note that python -d domain is at the end of data/domain/ already, so you don't need to run that command by yourself after running

Extra Partitioning 💜💛

Combine argument of other schemes to build custom label heterogeneity.

Note that performing extra partitioning would make the number of clients reset by --client_num instead of the client for each domain in you set previously.

python -d domain -a 1.0 -cn 10

Out-of-Distribution Settings 🕳

Set --ood_domains {domain_name...} to exclude domain(s) out of data partitioning and allocate corresponding data samples to test client(s).

Note that if --ood_domains is not empty, FL-bench will map the data labels from the class space to the domain space. So the data hetergeneity will be observed in the domain space instead of the class space.

One OOD domain for one test client.

python -d domain -a 1.0 -cn 10 --ood_domains sketch

python -d domain -a 1.0 -cn 10 --ood_domains sketch quickdraw

Acknowledgement 🤗

data/femnist, data/celeba, data/leaf_utils are copied from LEAF with subtle modifications to be integrated into this benchmark. data/femnist/ and data/celeba/ for full details.

FL-bench ignores the test set of Tiny-ImageNet-200 due to it is unlabeled.