RareGAN: Generating Samples for Rare Classes

[paper (AAAI 2022)] [paper (arXiv)] [code]

Authors: Zinan Lin, Hao Liang , Giulia Fanti, Vyas Sekar

Abstract: We study the problem of learning generative adversarial networks (GANs) for a rare class of an unlabeled dataset subject to a labeling budget. This problem is motivated from practical applications in domains including security (e.g., synthesizing packets for DNS amplification attacks), systems and networking (e.g., synthesizing workloads that trigger high resource usage), and machine learning (e.g., generating images from a rare class). Existing approaches are unsuitable, either requiring fully-labeled datasets or sacrificing the fidelity of the rare class for that of the common classes. We propose RareGAN, a novel synthesis of three key ideas: (1) extending conditional GANs to use labelled and unlabelled data for better generalization; (2) an active learning approach that requests the most useful labels; and (3) a weighted loss function to favor learning the rare class. We show that RareGAN achieves a better fidelity-diversity tradeoff on the rare class than prior work across different applications, budgets, rare class fractions, GAN losses, and architectures.

This repo contains the codes for reproducing the experiments of our RareGAN in the paper. The codes were tested under Python 3.6.9 + TensorFlow 1.15.2 and Python 3.7.13 + TensorFlow 2.8.2.

The code can be easily extended to your own applications, like synthesizing images from rare classes, or synthesizing data of more general formats (e.g., network packets, texts) for rare events (e.g., attacks).

Prerequisites

The codes are based on GPUTaskScheduler library, which helps you automatically schedule the jobs among GPU nodes. Please install it first. You may need to change GPU configurations according to the devices you have. The configurations are set in config_generate_data.py in each directory. Please refer to GPUTaskScheduler's GitHub page for the details of how to make proper configurations.

To run with TensorFlow 2, please install TensorFlow-Slim by pip install tf-slim.

Image Experiments: Generating Rare Samples for CIFAR10 and MNIST

CIFAR10

Preparing the data according to the instructions here.
Run

cd for_images
python -m scripts.CIFAR10.main_generate_data

MNIST

Preparing the data according to the instructions here.
Run

cd for_images
python -m scripts.MNIST.main_generate_data

Your Own Image Dataset

Simply add the data loading logic for your dataset here, and modify the training configuration file accordingly (example).

System Experiments: Generating Network Packets for DNS Amplification Attacks and Packet Classifier Attacks

DNS Amplification Attacks

In this configuration file, replace <FILL IN IP ADDRESS> with the IP address of the DNS server.
Run

cd for_systems
python -m scripts.DNS.main_generate_data

WARNING: During training, the code will generate a large number of DNS queries to the specified DNS server. Please make sure to use your own DNS servers in a sandboxed environment to avoid harming the public Internet.

Generating Packets that Trigger Long Processing Time for Packet Classifiers

To get an accurate evaluation of the packet processing time, we used separate servers for running RareGAN training and evaluating the processing time.

On the server for evaluation, run

cd for_systems
python3 -m blackboxes.main_start_rpc_runner_server

In this configuration file, replace <FILL IN IP ADDRESS> with the IP address of the evaluation server.
Run

cd for_systems
python -m scripts.PC.main_generate_data

Your Own Dataset or Application

The code supports a general data format and can be extended to any applications that want samples to have a large metric (e.g., packet amplification ratio in amplification attacks, or processing time of a system).

The following is all you need to do:

A JSON configuration file that defines the data format. The format is defined as a list of fields. Examples: DNS requests, network packets.
Extend the Blackbox class and implements query interface that takes a list of samples as input, and returns their metrics. Examples: packet size amplification ratio for DNS requests, packet classification time.
Add your blackbox creation logic here. Note that there is a list of handy Blackbox wrappers that you can use (e.g., off-loading the metric evaluation to a remote server, evaluating each sample multiple times, randomizing the order of samples, warming up the system by evaluating random samples before the actual evaluation happens.
Modify the training configuration file accordingly (example).

Results

The code generates the following result files/folders:

<code folder>/results/<hyper-parameters>/worker.log: Standard output and error from the code.
<code folder>/results/<hyper-parameters>/generated_data/data.npz: Generated data from the rare class.
<code folder>/results/<hyper-parameters>/sample/*.png (for image experiments only): Generated images during training.
<code folder>/results/<hyper-parameters>/checkpoint/*: TensorFlow checkpoints and customized checkpoints.
<code folder>/results/<hyper-parameters>/time.txt: Training iteration timestamps.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
for_images		for_images
for_systems		for_systems
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RareGAN: Generating Samples for Rare Classes

Prerequisites

Image Experiments: Generating Rare Samples for CIFAR10 and MNIST

CIFAR10

MNIST

Your Own Image Dataset

System Experiments: Generating Network Packets for DNS Amplification Attacks and Packet Classifier Attacks

DNS Amplification Attacks

Generating Packets that Trigger Long Processing Time for Packet Classifiers

Your Own Dataset or Application

Results

About

Releases

Packages

Contributors 2

Languages

License

fjxmlzn/RareGAN

Folders and files

Latest commit

History

Repository files navigation

RareGAN: Generating Samples for Rare Classes

Prerequisites

Image Experiments: Generating Rare Samples for CIFAR10 and MNIST

CIFAR10

MNIST

Your Own Image Dataset

System Experiments: Generating Network Packets for DNS Amplification Attacks and Packet Classifier Attacks

DNS Amplification Attacks

Generating Packets that Trigger Long Processing Time for Packet Classifiers

Your Own Dataset or Application

Results

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages