(Tested on a Quadro RTX 5000 with NVIDIA-SMI Driver Version: 535.104.05, CUDA Version: 12.2 on a UBUNTU 22.04)
-
Build Dockerfile
docker build -t whispervits-svc .
-
Enter Docker container
-
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put
best_model.pth.tar
intospeaker_pretrain/
. -
Download whisper model whisper-large-v2. Make sure to download
large-v2.pt
,put it intowhisper_pretrain/
. -
Download hubert_soft model,put
hubert-soft-0d54a1f4.pt
intohubert_pretrain/
. -
Download pitch extractor crepe full,put
full.pth
intocrepe/assets
.Note: crepe full.pth is 84.9 MB, not 6kb
-
Download trained model lesd5_100.pretrain.pth, and put it into
vits_pretrain/
. -
Make sure you have downloaded the wav_spk_1 folder from the Benchmarking-SGDD repository. Then, run the script.
python convert-TWH-spk1.py /path/to/wav_spk_1
The output will be a folder containing all conversions used on the evaluation. The same that is found on this google drive.