Skip to content

Latest commit

 

History

History
191 lines (139 loc) · 5.65 KB

README.md

File metadata and controls

191 lines (139 loc) · 5.65 KB

SoloAudio

Paper HuggingFace Models Colab Demo page

Official Pytorch implementation of the ICASSP 2025 paper: SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer.

Try our Huggingface space!!!

TODO

  • Release model weights
  • Release data
  • HuggingFace Spaces demo
  • VAE training code
  • arxiv paper

Environment setup

conda env create -f env.yml
conda activate soloaudio

Pretrained Models

Download our pretrained models from huggingface.

After downloading the files, put them under this repo, like:

SoloAudio/
    -config/
    -demo/
    -pretrained_models/
    ....

Inference examples

For audio-oriented TSE, please run:

python tse_audioTSE.py --output_dir './output-audioTSE/' --mixture './demo/1_mix.wav' --enrollment './demo/1_enrollment.wav'

For language-oriented TSE, please run:

python tse_languageTSE.py --output_dir './output-languageTSE/' --mixture './demo/1_mix.wav' --enrollment 'Acoustic guitar'

Data Preparation

To train a SoloAudio model, you need to prepare the following parts:

  1. Prepare the FSD-Mix DataSet, please run:
cd data_preparating/
python create_filenames.py
python create_fsdmix.py

You can also use our simulated data for training, validataion and test.

  1. Prepare the TangoSyn DataSet, please run:
cd tango/
sh gen.sh
  1. Prepare the TangoSyn-Mix DataSet like step 1.

  2. Extract the VAE features, please run:

python extract_vae.py --data_dir "YOUR_DATA_DIR" --output_dir "YOUR_OUTPUT_DIR"
  1. Extract the CLAP features, please run:
python extract_clap_audio.py --input_base_dir "YOUR_DATA_DIR" --output_base_dir "YOUR_OUTPUT_DIR"
python extract_clap_text.py --input_base_dir "YOUR_DATA_DIR" --output_base_dir "YOUR_OUTPUT_DIR" --split 1
python extract_clap_text.py --input_base_dir "YOUR_DATA_DIR" --output_base_dir "YOUR_OUTPUT_DIR" --split 2
python extract_clap_text.py --input_base_dir "YOUR_DATA_DIR" --output_base_dir "YOUR_OUTPUT_DIR" --split 3

Training

Now, you are good to start training!

  1. Train with a single GPU, please run:
python train.py
  1. Train with multiple GPUs, please run:
accelerate launch train.py

Test

To test a folder of audio files, please run:

python test_audioTSE.py --output_dir './test-audioTSE/' --test_dir '/YOUR_PATH_TO_TEST/'

OR

python test_languageTSE.py --output_dir './test-languageTSE/' --test_dir '/YOUR_PATH_TO_TEST/'

To calculate the metrics used in the paper, please run:

cd metircs/
python main.py

VAE Training

We provide codes to train an audio waveform VAE model, reference to stable-audio-tools.

  1. Change data path in stable_audio_vae/configs/vae_data.txt (any folder contains audio files).

  2. Change model config in stable_audio_vae/configs/vae_16k_mono_v2.config.

We provide config for training audio files of 16k sampling rate, please change the settings when you want other sampling rates.

  1. Change batch size and training settings in stable_audio_vae/defaults.ini.

  2. Run:

cd stable_audio_vae/
bash train_bash.sh

License

The codebase is under MIT LICENSE.

Citations

@article{helin2024soloaudio,
  author    = {Wang, Helin and Hai, Jiarui and Lu, Yen-Ju and Thakkar, Karan and Elhilali, Mounya and Dehak, Najim},
  title     = {SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer},
  journal   = {arXiv},
  year      = {2024},
}

@INPROCEEDINGS{jiarui2024dpmtse,
  author={Hai, Jiarui and Wang, Helin and Yang, Dongchao and Thakkar, Karan and Dehak, Najim and Elhilali, Mounya},
  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction}, 
  year={2024},
  pages={1196-1200},
  }