CoAutoGen: Cloud-Edge Collaboration Platform for Automated Synthetic Dataset Generation

CoAutoGen tackles dwindling training data by integrating dataset generation methods with cloud-edge collaboration, enabling automatic scoring and evolution in synthetic dataset creation via a feedback loop using refined prompts, privacy-protected data, and offline or online APIs.

You can access free online APIs for text, image, video, audio, and more at SiliconFlow or explore affordable options at getimg.ai.

AI's growth has relied on scaling neural networks and training on massive datasets, enabling LLMs like ChatGPT to handle conversations and reasoning (see this nature paper). However, experts warn of limits as energy demands rise and training data dwindles (see this nature paper). Epoch AI predicts AI training data will be exhausted by 2028 (see this nature paper).

Features

Enabling self-correction in synthetic dataset creation through feedback with refined prompts or privacy-protected data.
Intuitive, user-friendly code style.
Customizable rater design for synthetic data rating, selection, and filtering.
Automate synthetic dataset generation with large generative model APIs across various modalities.
Automatically filter and iteratively refine high-quality synthetic datasets.
Flexible interfaces for seamless extension and customization.

How to Use

Prepare the Required Large Model APIs (Skip If Using Online APIs).
- Set up large model APIs either through local deployment (downloading model weights for captioner, generator, LLM, etc.) or via online-accessible APIs.

Set Up the Environment

Install CUDA.
Install the latest Conda and activate it.

Create the Conda environment:

conda env create -f env_cuda_latest.yaml  
# You may need to downgrade PyTorch using pip to match the CUDA version

Run the Script
- Execute python main.py with your configurations.
- Available Frameworks:
  - Generate with prompts: --framework Gen
  - Generate with LLM-enhanced prompts: --framework GenLLM
  - Iteratively generate, filter, and accumulate: --framework Filter
  - Iteratively generate, rate, and provide feedback with privacy protection: --framework Feedback
- Available Raters (when --framework Filter or Feedback):
  - Histogram rating in Private Evolution (PE, ICLR'24): --rater PE
  - Real data rating in Real Filter (RF, ICLR'23): --rater RF

Example

For a COVID-19 pneumonia detection task, generate 100 synthetic images per class based on 10 real and private chest radiography (X-ray) images on the edge using the Stable Diffusion API. The edge device utilizes a ResNet-18, with Private Evolution (PE) for rating and feedback provided with privacy protection:

python -u main.py \
  -tt syn \       # Task Type: Only using the synthetic dataset for downstream task
  -tm I2I \       # Task Mode: Image to Image
  -f Feedback \   # Framework: Feedback mechanism
  -did 1 \        # GPU device ID
  -eps 0.2 \      # Privacy budget epsilon per iteration
  -rvpl 1 \       # Real and private volume per label
  -vpl 2 \        # Generated volume per label
  -oa 1 \         # Use online API
  -sgen StableDiffusionXL \ # Select StableDiffusionXL as the generative model
  -cret 1 \       # Other hyperparameter
  -cue ResNet18 \ # Edge client embedding model
  -cmodel ResNet18 \ # Edge client model
  -cmp 1 \        # Other hyperparameter
  -cef 1 \        # Other hyperparameter
  -cdata COVIDx \ # Pravate dataset
  -r PE           # Rater: Private Evolution

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
algo		algo
captioner		captioner
generator		generator
llm/Llama-2-7b-chat-hf		llm/Llama-2-7b-chat-hf
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clean_generated_files.py		clean_generated_files.py
env_cuda_latest.yaml		env_cuda_latest.yaml
main.py		main.py
prepare.sh		prepare.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoAutoGen: Cloud-Edge Collaboration Platform for Automated Synthetic Dataset Generation

Features

How to Use

Example

About

Releases

Packages

Languages

License

TsingZ0/CoAutoGen

Folders and files

Latest commit

History

Repository files navigation

CoAutoGen: Cloud-Edge Collaboration Platform for Automated Synthetic Dataset Generation

Features

How to Use

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages