[paper (AAAI 2022)] [paper (arXiv)] [code]
Authors: Zinan Lin, Hao Liang , Giulia Fanti, Vyas Sekar
Abstract: We study the problem of learning generative adversarial networks (GANs) for a rare class of an unlabeled dataset subject to a labeling budget. This problem is motivated from practical applications in domains including security (e.g., synthesizing packets for DNS amplification attacks), systems and networking (e.g., synthesizing workloads that trigger high resource usage), and machine learning (e.g., generating images from a rare class). Existing approaches are unsuitable, either requiring fully-labeled datasets or sacrificing the fidelity of the rare class for that of the common classes. We propose RareGAN, a novel synthesis of three key ideas: (1) extending conditional GANs to use labelled and unlabelled data for better generalization; (2) an active learning approach that requests the most useful labels; and (3) a weighted loss function to favor learning the rare class. We show that RareGAN achieves a better fidelity-diversity tradeoff on the rare class than prior work across different applications, budgets, rare class fractions, GAN losses, and architectures.
This repo contains the codes for reproducing the experiments of our RareGAN in the paper. The codes were tested under Python 3.6.9 + TensorFlow 1.15.2 and Python 3.7.13 + TensorFlow 2.8.2.
The code can be easily extended to your own applications, like synthesizing images from rare classes, or synthesizing data of more general formats (e.g., network packets, texts) for rare events (e.g., attacks).
The codes are based on GPUTaskScheduler library, which helps you automatically schedule the jobs among GPU nodes. Please install it first. You may need to change GPU configurations according to the devices you have. The configurations are set in config_generate_data.py
in each directory. Please refer to GPUTaskScheduler's GitHub page for the details of how to make proper configurations.
To run with TensorFlow 2, please install TensorFlow-Slim by pip install tf-slim
.
- Preparing the data according to the instructions here.
- Run
cd for_images
python -m scripts.CIFAR10.main_generate_data
- Preparing the data according to the instructions here.
- Run
cd for_images
python -m scripts.MNIST.main_generate_data
Simply add the data loading logic for your dataset here, and modify the training configuration file accordingly (example).
System Experiments: Generating Network Packets for DNS Amplification Attacks and Packet Classifier Attacks
- In this configuration file, replace
<FILL IN IP ADDRESS>
with the IP address of the DNS server. - Run
cd for_systems
python -m scripts.DNS.main_generate_data
WARNING: During training, the code will generate a large number of DNS queries to the specified DNS server. Please make sure to use your own DNS servers in a sandboxed environment to avoid harming the public Internet.
To get an accurate evaluation of the packet processing time, we used separate servers for running RareGAN training and evaluating the processing time.
- On the server for evaluation, run
cd for_systems
python3 -m blackboxes.main_start_rpc_runner_server
- In this configuration file, replace
<FILL IN IP ADDRESS>
with the IP address of the evaluation server. - Run
cd for_systems
python -m scripts.PC.main_generate_data
The code supports a general data format and can be extended to any applications that want samples to have a large metric (e.g., packet amplification ratio in amplification attacks, or processing time of a system).
The following is all you need to do:
- A JSON configuration file that defines the data format. The format is defined as a list of fields. Examples: DNS requests, network packets.
- Extend the Blackbox class and implements
query
interface that takes a list of samples as input, and returns their metrics. Examples: packet size amplification ratio for DNS requests, packet classification time. - Add your blackbox creation logic here. Note that there is a list of handy Blackbox wrappers that you can use (e.g., off-loading the metric evaluation to a remote server, evaluating each sample multiple times, randomizing the order of samples, warming up the system by evaluating random samples before the actual evaluation happens.
- Modify the training configuration file accordingly (example).
The code generates the following result files/folders:
<code folder>/results/<hyper-parameters>/worker.log
: Standard output and error from the code.<code folder>/results/<hyper-parameters>/generated_data/data.npz
: Generated data from the rare class.<code folder>/results/<hyper-parameters>/sample/*.png
(for image experiments only): Generated images during training.<code folder>/results/<hyper-parameters>/checkpoint/*
: TensorFlow checkpoints and customized checkpoints.<code folder>/results/<hyper-parameters>/time.txt
: Training iteration timestamps.