Skip to content

Latest commit

 

History

History
35 lines (30 loc) · 1.76 KB

README.md

File metadata and controls

35 lines (30 loc) · 1.76 KB

SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech

This is the unofficial pytorch implementation of SNAC We built up our codes based on VITS

  1. VCTK dataset is used.
  2. LibriTTS dataset (train-clean-100 and train-clean-360) is also supported.
  3. This is the implementation of Proposed + REF + FLOW in the paper.
  4. Major modifications are applied in modules.py (SN/SDN transformation) and losses.py (Revised log determinant)
  5. We followed the same flow setting with VITS, using volume-preserving transformation with the Jacobian determinant of one.
Text Encoder Duration Predictor Flow Vocoder
None Input addition SNAC None

Prerequisites

  1. Clone this repository.
  2. Install python requirements. Please refer requirements.txt
    1. You may need to install espeak first: apt-get install espeak
  3. Download datasets
    1. Download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY3
    2. For LibriTTS dataset, downsample wav files to 22050 Hz and link to the dataset folder: ln -s /path/to/LibriTTS DUMMY2
  4. Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

Training Exmaple

python train.py -c configs/vctk_base.json -m vctk_base

Inference Example

See inference.ipynb