This repository provides a guide on how to prepare a dataset and execute fine-tuning using the StyleTTS2 process. https://github.com/yl4579/StyleTTS2
- A
gradio webuitext generation webui extension providing TTS with your fine tuned model will be finished soon.
- 12/6/23: I noticed segmentation from the whisperx .json was unacceptable. I created a segmentation script that uses the .srt file that the whisperx command generates. From what I can tell this is significantly more accurate. This could be dataset specific. Use the json segmenter if needed.
- 12/5/23: Fixed a missing "else" in the Segmentation script.
- 12/4/23: A working config_ft.yml file is available in the tools folder.
- 12/2/23: Rewrote Segmentation and Transcription scripts.
The scripts are compatible with WSL2 and Linux. Windows requires additional dependencies and might not be worth the effort.
- Install conda and activate environment with Python 3.10:
- conda create --name dataset python==3.10
- conda activate dataset
- pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -U
- pip install git+https://github.com/m-bain/whisperx.git
- pip install phonemizer pydub
- Place a single 24khz .wav file in the /StyleGuide/makeDataset folder.
- Run the whisperx command on the wav file:
- whisperx /StyleGuide/makeDataset/wavfile.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H
- (Run this on the command line. If your GPU cant handle it there are other models you can use besides large-v2)
- whisperx /StyleGuide/makeDataset/wavfile.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H
- The above command will generate a set of transcriptions. Save the resulting files.
- Navigate to the tools directory:
- Open srtsegmenter.py and fill out all the file paths.
- Run the segmentation script:
The above steps will generate a set of segmented audio files, a folder of bad audio it didn't like, and an output.txt file. I have it set to throw out segmemts under one second and over 11.6 seconds. You can adjust this to varying degrees.
- Open phonemized.py and fill out the file paths.
- Run the script.
- This script will create the train_list.txt and val_list.txt files.
- OOD_list.txt comes from the LibriTTS dataset. The following are some things to consider taken from the notes at yl4579/StyleTTS2#81. There is a lot of good information there, I suggest looking it over.
-
The LibriTTS dataset has poor punctuation and a mismatch of spoken/unspoken pauses with the transcripts. This is a common oversight in many datasets.
-
Also it lacks variety of punctuation. In the field, you may encounter texts with creative use of dashes, pauses and combination of quotes and punctuation. LibriTTS lacks those cases. But the model can learn these!
-
Additionally, LibriTTS has stray quotes in some texts, or begins a sentence with a quote. These things reduce quality a little (or a lot, sometimes). You will want to filter those out.
-
Creating your own ODD_list.txt is an option. I need to play around with it more, the only real requirements should be good punctuation and that it contains text the model has not seen. I'm not sure what the ideal size of this list should be though.
- At this point, you should have everything you need to fine-tune.
-
Clone the StyleTTS2 repository and navigate to its directory:
-
Install the required packages:
- cd StyleTTS2
- pip install -r requirements.txt
- sudo apt-get install espeak-ng
-
Prepare the data and model:
- Clear the wavs folder in the data directory and replace with your segmented wav files.
- Replace the val_list and train_list files in the Data folder with yours. Keep the OOD_list.txt file.
- Adjust the parameters in the config_ft.yml file in the Configs folder according to your needs.
-
Download the StyleTTS2-LibriTTS model and place it in the Models/LibriTTS directory.
Finally, you can start the fine-tuning process with the following command:
- accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml