vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Important

Pretrained model (with vq-wav2vec as input) and training procedure are released!

Comparison of our model and various famous VC models:

Environment

Please refer to environment directory for a requirements.txt and Dockerfile.

In addition, for convenience, we also provide a Docker image for Linux, so you can easily run the Docker container:

docker pull cantabilekwok511/vec2wav2.0:v0.2
docker run -it -v /path/to/vec2wav2.0:/workspace cantabilekwok511/vec2wav2.0:v0.2

Voice Conversion with Pretrained Model

We provide a simple VC interface.

First, please make sure some required models are downloaded in the pretrained/ directory:

vq-wav2vec model from this url
WavLM-Large from this url
Pre-trained vec2wav 2.0 (on vq-wav2vec tokens) from 🤗Huggingface

The resulting directory should look like this:

pretrained/
    - vq-wav2vec_kmeans.pt 
    - WavLM-Large.pt 
    - generator.ckpt
    - config.yml

Then VC can be done by

source path.sh
vc.py -s $source_wav -t $speaker_prompt -o $output_wav

where $source_wav, $speaker_prompt should both be mono-channel audio and preferably .wav files. This script by default tries to load pretrained/generator.ckpt and the corresponding config.yml. You can provide --expdir to change this path.

If you have trained you own model under $expdir, please specify the checkpoint filename:

vc.py -s $source_wav -t $speaker_prompt -o $output_wav \
      --expdir $expdir --checkpoint /path/to/checkpoint.pkl

Web Interface

We also provide a VC web interface using Gradio. To try our online interactive demo: 🤗HuggingFace.

To launch it locally:

# Make sure gradio is installed first
pip install gradio
python vec2wav2/bin/gradio_app.py

This will start a local web server and open the interface in your browser. You can:

Upload source audio (the voice you want to convert)
Upload target speaker audio (the voice you want to convert to)
Click "Convert Voice" to perform the conversion
Listen to or download the converted audio

The web interface uses the same models and settings as the command-line tool.

Training

First, we need to set up data manifests and features. Please refer to ./data_prep.md for a guide on LibriTTS dataset.

Then, please refer to ./train.sh for training. It will automatically launch pytorch DDP training on all the devices in CUDA_VISIBLE_DEVICES. Please change os.environ["MASTER_PORT"] in vec2wav2/bin/train.py if you need.

Decoding (VQ tokens to wav)

If you want to decode VQ features in existing feats.scp into wavs, you can use

decode.py --feats-scp /path/to/feats.scp --prompt-scp /path/to/prompt.scp \
          --checkpoint /path/to/checkpoint.pkl --config /path/to/config.yml \
          --outdir /path/to/output_dir

Here, prompt.scp specifies every utterance (content VQ tokens) and its prompts (WavLM features). It is organized in a similar style with feats.scp.

Citation

@article{guo2024vec2wav,
  title={vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders},
  author={Guo, Yiwei and Li, Zhihan and Li, Junjie and Du, Chenpeng and Wang, Hankun and Wang, Shuai and Chen, Xie and Yu, Kai},
  journal={arXiv preprint arXiv:2409.01995},
  year={2024}
}

🔍 See Also: The vec2wav family

[paper] vec2wav in VQTTS. Single-speaker.
[paper][code] CTX-vec2wav in UniCATS. Multi-speaker with acoustic prompts. Lots of code borrowed from there.
🌟(This) vec2wav 2.0. Enhanced in timbre controllability, best for VC!

Acknowledgements

kan-bayashi/ParallelWaveGAN for the whole project structure.
NVIDIA/BigVGAN for the vocoder backbone.
Kaldi and ESPnet for providing useful tools and Conformer implementation.
Fairseq for some network architectures.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
alias_free_torch		alias_free_torch
celebrity_samples		celebrity_samples
conf		conf
environment		environment
examples		examples
local		local
utils		utils
vec2wav2		vec2wav2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cmd.sh		cmd.sh
comparison.png		comparison.png
data_prep.md		data_prep.md
extract_fbank.sh		extract_fbank.sh
path.sh		path.sh
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Environment

Voice Conversion with Pretrained Model

Web Interface

Training

Decoding (VQ tokens to wav)

Citation

🔍 See Also: The vec2wav family

Acknowledgements

About

Releases

Packages

Contributors 4

Languages

License

cantabile-kwok/vec2wav2.0

Folders and files

Latest commit

History

Repository files navigation

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Environment

Voice Conversion with Pretrained Model

Web Interface

Training

Decoding (VQ tokens to wav)

Citation

🔍 See Also: The vec2wav family

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages