Skip to content

fjxmlzn/RareGAN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RareGAN: Generating Samples for Rare Classes

[paper (AAAI 2022)] [paper (arXiv)] [code]

Authors: Zinan Lin, Hao Liang , Giulia Fanti, Vyas Sekar

Abstract: We study the problem of learning generative adversarial networks (GANs) for a rare class of an unlabeled dataset subject to a labeling budget. This problem is motivated from practical applications in domains including security (e.g., synthesizing packets for DNS amplification attacks), systems and networking (e.g., synthesizing workloads that trigger high resource usage), and machine learning (e.g., generating images from a rare class). Existing approaches are unsuitable, either requiring fully-labeled datasets or sacrificing the fidelity of the rare class for that of the common classes. We propose RareGAN, a novel synthesis of three key ideas: (1) extending conditional GANs to use labelled and unlabelled data for better generalization; (2) an active learning approach that requests the most useful labels; and (3) a weighted loss function to favor learning the rare class. We show that RareGAN achieves a better fidelity-diversity tradeoff on the rare class than prior work across different applications, budgets, rare class fractions, GAN losses, and architectures.


This repo contains the codes for reproducing the experiments of our RareGAN in the paper. The codes were tested under Python 3.6.9 + TensorFlow 1.15.2 and Python 3.7.13 + TensorFlow 2.8.2.

The code can be easily extended to your own applications, like synthesizing images from rare classes, or synthesizing data of more general formats (e.g., network packets, texts) for rare events (e.g., attacks).

Prerequisites

The codes are based on GPUTaskScheduler library, which helps you automatically schedule the jobs among GPU nodes. Please install it first. You may need to change GPU configurations according to the devices you have. The configurations are set in config_generate_data.py in each directory. Please refer to GPUTaskScheduler's GitHub page for the details of how to make proper configurations.

To run with TensorFlow 2, please install TensorFlow-Slim by pip install tf-slim.

Image Experiments: Generating Rare Samples for CIFAR10 and MNIST

CIFAR10

  • Preparing the data according to the instructions here.
  • Run
cd for_images
python -m scripts.CIFAR10.main_generate_data

MNIST

  • Preparing the data according to the instructions here.
  • Run
cd for_images
python -m scripts.MNIST.main_generate_data

Your Own Image Dataset

Simply add the data loading logic for your dataset here, and modify the training configuration file accordingly (example).

System Experiments: Generating Network Packets for DNS Amplification Attacks and Packet Classifier Attacks

DNS Amplification Attacks

cd for_systems
python -m scripts.DNS.main_generate_data

WARNING: During training, the code will generate a large number of DNS queries to the specified DNS server. Please make sure to use your own DNS servers in a sandboxed environment to avoid harming the public Internet.

Generating Packets that Trigger Long Processing Time for Packet Classifiers

To get an accurate evaluation of the packet processing time, we used separate servers for running RareGAN training and evaluating the processing time.

  • On the server for evaluation, run
cd for_systems
python3 -m blackboxes.main_start_rpc_runner_server
cd for_systems
python -m scripts.PC.main_generate_data

Your Own Dataset or Application

The code supports a general data format and can be extended to any applications that want samples to have a large metric (e.g., packet amplification ratio in amplification attacks, or processing time of a system).

The following is all you need to do:

Results

The code generates the following result files/folders:

  • <code folder>/results/<hyper-parameters>/worker.log: Standard output and error from the code.
  • <code folder>/results/<hyper-parameters>/generated_data/data.npz: Generated data from the rare class.
  • <code folder>/results/<hyper-parameters>/sample/*.png (for image experiments only): Generated images during training.
  • <code folder>/results/<hyper-parameters>/checkpoint/*: TensorFlow checkpoints and customized checkpoints.
  • <code folder>/results/<hyper-parameters>/time.txt: Training iteration timestamps.