Skip to content

Cloud-Edge Collaboration Platform for Automated Synthetic Dataset Generation

License

Notifications You must be signed in to change notification settings

TsingZ0/CoAutoGen

Repository files navigation

CoAutoGen: Cloud-Edge Collaboration Platform for Automated Synthetic Dataset Generation

CoAutoGen tackles dwindling training data by integrating dataset generation methods with cloud-edge collaboration, enabling automatic scoring and evolution in synthetic dataset creation via a feedback loop using refined prompts, privacy-protected data, and offline or online APIs.

You can access free online APIs for text, image, video, audio, and more at SiliconFlow or explore affordable options at getimg.ai.

Running Out of Data

AI's growth has relied on scaling neural networks and training on massive datasets, enabling LLMs like ChatGPT to handle conversations and reasoning (see this nature paper). However, experts warn of limits as energy demands rise and training data dwindles (see this nature paper). Epoch AI predicts AI training data will be exhausted by 2028 (see this nature paper).

Features

  • Enabling self-correction in synthetic dataset creation through feedback with refined prompts or privacy-protected data.
  • Intuitive, user-friendly code style.
  • Customizable rater design for synthetic data rating, selection, and filtering.
  • Automate synthetic dataset generation with large generative model APIs across various modalities.
  • Automatically filter and iteratively refine high-quality synthetic datasets.
  • Flexible interfaces for seamless extension and customization.

How to Use

  1. Prepare the Required Large Model APIs (Skip If Using Online APIs).

    • Set up large model APIs either through local deployment (downloading model weights for captioner, generator, LLM, etc.) or via online-accessible APIs.
  2. Set Up the Environment

    • Install CUDA.
    • Install the latest Conda and activate it.
    • Create the Conda environment:
      conda env create -f env_cuda_latest.yaml  
      # You may need to downgrade PyTorch using pip to match the CUDA version  
  3. Run the Script

    • Execute python main.py with your configurations.
    • Available Frameworks:
      • Generate with prompts: --framework Gen
      • Generate with LLM-enhanced prompts: --framework GenLLM
      • Iteratively generate, filter, and accumulate: --framework Filter
      • Iteratively generate, rate, and provide feedback with privacy protection: --framework Feedback
    • Available Raters (when --framework Filter or Feedback):

Example

For a COVID-19 pneumonia detection task, generate 100 synthetic images per class based on 10 real and private chest radiography (X-ray) images on the edge using the Stable Diffusion API. The edge device utilizes a ResNet-18, with Private Evolution (PE) for rating and feedback provided with privacy protection:

python -u main.py \
  -tt syn \       # Task Type: Only using the synthetic dataset for downstream task
  -tm I2I \       # Task Mode: Image to Image
  -f Feedback \   # Framework: Feedback mechanism
  -did 1 \        # GPU device ID
  -eps 0.2 \      # Privacy budget epsilon per iteration
  -rvpl 1 \       # Real and private volume per label
  -vpl 2 \        # Generated volume per label
  -oa 1 \         # Use online API
  -sgen StableDiffusionXL \ # Select StableDiffusionXL as the generative model
  -cret 1 \       # Other hyperparameter
  -cue ResNet18 \ # Edge client embedding model
  -cmodel ResNet18 \ # Edge client model
  -cmp 1 \        # Other hyperparameter
  -cef 1 \        # Other hyperparameter
  -cdata COVIDx \ # Pravate dataset
  -r PE           # Rater: Private Evolution

About

Cloud-Edge Collaboration Platform for Automated Synthetic Dataset Generation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published