SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech
This is the unofficial pytorch implementation of SNAC We built up our codes based on VITS
- VCTK dataset is used.
- LibriTTS dataset (train-clean-100 and train-clean-360) is also supported.
- This is the implementation of
Proposed + REF + FLOW
in the paper. - Major modifications are applied in
modules.py
(SN/SDN transformation) andlosses.py
(Revised log determinant) - We followed the same flow setting with VITS, using volume-preserving transformation with the Jacobian determinant of one.
Text Encoder | Duration Predictor | Flow | Vocoder |
---|---|---|---|
None | Input addition | SNAC | None |
- Clone this repository.
- Install python requirements. Please refer requirements.txt
- You may need to install espeak first:
apt-get install espeak
- You may need to install espeak first:
- Download datasets
- Download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder:
ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY3
- For LibriTTS dataset, downsample wav files to 22050 Hz and link to the dataset folder:
ln -s /path/to/LibriTTS DUMMY2
- Download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder:
- Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace
python train.py -c configs/vctk_base.json -m vctk_base
See inference.ipynb