Poor audio quality after fine-tuning #49

danielmsu · 2023-11-21T05:02:19Z

I'm trying to fine-tune the LibriTTS checkpoint on ~1 hour of LJSpeech but get poor results. Could you please give me some directions or help to spot the issue?

How I fine-tuned:

Pulled the latest changes from the repo
Replaced Data/train_list.txt with a copy that only has the first 1000 lines (~1 hour for training)
Changed batch_size to 4 and max_len to 100, otherwise it doesn't fit into the memory of my 4090 (24GB).
After training it for 50-100 epochs, I tested new checkpoints with both Inference_LibriTTS.ipynb and Inference_LJSpeech.ipynb notebooks by changing the multispeaker parameter in the config to true/false.
Inference_LJSpeech.ipynb produces very noisy results with a poor pronunciation.
Inference_LibriTTS.ipynb with reference audio from LJSpeech has a good pronunciation, but there are noticeable noises (example - https://voca.ro/1nQ8Ltjhsh9y)

Thank you again for the awesome project!

The text was updated successfully, but these errors were encountered:

yl4579 · 2023-11-21T05:40:14Z

For 4, did you change multispeaker to true or false? The default is true, and the default settings do produce better results than you have. The only difference I can see is batch_size (from 16 to 4), but it shouldn't produce this big difference. max_len from 400 to 100 is probably the cause. This is what I got by finetuning with one hour of data: https://voca.ro/1aC4vr4jErDL using the default setting.

danielmsu · 2023-11-21T19:07:21Z

For 4, did you change multispeaker to true or false?

I fine-tuned the model with multispeaker:true and then tried inference with both true and false. It definitely works better with true, the example I attached is also generated with multispeaker:true. I didn't try to fine-tune it with false, but I guess a model fine-tuned with true in the config should produce better results anyway, is that correct?

max_len from 400 to 100 is probably the cause

Do you know what is the minimal value for decent results? Unfortunately, I cannot use 400, but maybe I could set it a bit higher than 100 if I reduce batch_size even more. Training speed is not a concern for me.

This is what I got by finetuning with one hour of data: https://voca.ro/1aC4vr4jErDL using the default setting.

Yes, that sounds much better. Could you please share inference parameters? Would be awesome if you still have alpha/beta values and the name of the reference clip, so I can compare my results using the same values.

Thanks!

yl4579 · 2023-11-21T20:17:13Z

Yes, you can leave multispeaker setting to true. I used the same inference code as in the Colab notebook: https://colab.research.google.com/github/yl4579/StyleTTS2/blob/main/Colab/StyleTTS2_Finetune_Demo.ipynb

I haven't really tested with different max_len, but try to increase it as much as you can while keeping the batch size at least 2, and also do the SLM adversarial training run if you could (this is very RAM consuming though). I know right now the code is not very friendly to low RAM GPUs because of DP implementation. You can wait for fixed DDP implantations for mixed precision training.

danielmsu closed this as completed Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor audio quality after fine-tuning #49

Poor audio quality after fine-tuning #49

danielmsu commented Nov 21, 2023

yl4579 commented Nov 21, 2023

danielmsu commented Nov 21, 2023 •

edited

Loading

yl4579 commented Nov 21, 2023

Poor audio quality after fine-tuning #49

Poor audio quality after fine-tuning #49

Comments

danielmsu commented Nov 21, 2023

yl4579 commented Nov 21, 2023

danielmsu commented Nov 21, 2023 • edited Loading

yl4579 commented Nov 21, 2023

danielmsu commented Nov 21, 2023 •

edited

Loading