Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor audio quality after fine-tuning #49

Closed
danielmsu opened this issue Nov 21, 2023 · 3 comments
Closed

Poor audio quality after fine-tuning #49

danielmsu opened this issue Nov 21, 2023 · 3 comments

Comments

@danielmsu
Copy link
Contributor

I'm trying to fine-tune the LibriTTS checkpoint on ~1 hour of LJSpeech but get poor results. Could you please give me some directions or help to spot the issue?

How I fine-tuned:

  1. Pulled the latest changes from the repo
  2. Replaced Data/train_list.txt with a copy that only has the first 1000 lines (~1 hour for training)
  3. Changed batch_size to 4 and max_len to 100, otherwise it doesn't fit into the memory of my 4090 (24GB).
  4. After training it for 50-100 epochs, I tested new checkpoints with both Inference_LibriTTS.ipynb and Inference_LJSpeech.ipynb notebooks by changing the multispeaker parameter in the config to true/false.
  5. Inference_LJSpeech.ipynb produces very noisy results with a poor pronunciation.
  6. Inference_LibriTTS.ipynb with reference audio from LJSpeech has a good pronunciation, but there are noticeable noises (example - https://voca.ro/1nQ8Ltjhsh9y)

Thank you again for the awesome project!

@yl4579
Copy link
Owner

yl4579 commented Nov 21, 2023

For 4, did you change multispeaker to true or false? The default is true, and the default settings do produce better results than you have. The only difference I can see is batch_size (from 16 to 4), but it shouldn't produce this big difference. max_len from 400 to 100 is probably the cause. This is what I got by finetuning with one hour of data: https://voca.ro/1aC4vr4jErDL using the default setting.

@danielmsu
Copy link
Contributor Author

danielmsu commented Nov 21, 2023

For 4, did you change multispeaker to true or false?

I fine-tuned the model with multispeaker:true and then tried inference with both true and false. It definitely works better with true, the example I attached is also generated with multispeaker:true. I didn't try to fine-tune it with false, but I guess a model fine-tuned with true in the config should produce better results anyway, is that correct?

max_len from 400 to 100 is probably the cause

Do you know what is the minimal value for decent results? Unfortunately, I cannot use 400, but maybe I could set it a bit higher than 100 if I reduce batch_size even more. Training speed is not a concern for me.

This is what I got by finetuning with one hour of data: https://voca.ro/1aC4vr4jErDL using the default setting.

Yes, that sounds much better. Could you please share inference parameters? Would be awesome if you still have alpha/beta values and the name of the reference clip, so I can compare my results using the same values.

Thanks!

@yl4579
Copy link
Owner

yl4579 commented Nov 21, 2023

Yes, you can leave multispeaker setting to true. I used the same inference code as in the Colab notebook: https://colab.research.google.com/github/yl4579/StyleTTS2/blob/main/Colab/StyleTTS2_Finetune_Demo.ipynb

I haven't really tested with different max_len, but try to increase it as much as you can while keeping the batch size at least 2, and also do the SLM adversarial training run if you could (this is very RAM consuming though). I know right now the code is not very friendly to low RAM GPUs because of DP implementation. You can wait for fixed DDP implantations for mixed precision training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants