This repository is a fork of Real Time Voice Cloning (RTVC) with a synthesizer that works for the Spanish language. You can check my paper for a more detailed explanation. You can listen to the demo audios from all the Spanish models we trained (and a sample from RacoonML's trained model, too) here.
URL | Designation | Title | Implementation source |
---|---|---|---|
1806.04558 | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | This repo |
1802.08435 | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | fatchord/WaveRNN |
1703.10135 | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | fatchord/WaveRNN |
1710.10467 | GE2E (encoder) | Generalized End-To-End Loss for Speaker Verification | This repo |
Mozilla's Common Voice Spanish dataset
- Both Windows and Linux are supported. A GPU is recommended for training and for inference speed, but is not mandatory.
- Python 3.7 is recommended. Python 3.5 or greater should work, but you'll probably have to tweak the dependencies' versions. I recommend setting up a virtual environment using
venv
, but this is optional. - Install ffmpeg. This is necessary for reading audio files.
- Install PyTorch. Pick the latest stable version, your operating system, your package manager (pip by default) and finally pick any of the proposed CUDA versions if you have a GPU, otherwise pick CPU. Run the given command.
- Install the remaining requirements with
pip install -r requirements.txt
Python 3.6 or 3.7 is needed to run the toolbox.
- Install PyTorch (>=1.1.0).
- Install ffmpeg.
- Run
pip install -r requirements.txt
to install the remaining necessary packages.
Download the latest here.
python demo_cli.py
If all tests pass, you're good to go.
You can then try the toolbox:
python demo_toolbox.py