[BUG]: Lower quality than the examples on the demo page #334

GalenMarek14 · 2024-11-06T01:08:47Z

Describe the bug

After following the installation instructions (plus replacing phonemizer with https://github.com/justinjohn0306/phonemizer to make it work on Win 11), and using the same examples from the demo page, I was unable to replicate the quality of the examples. For example, the whispering voice always outputs something between a whisper and a normal voice. I tried both the inference script and the Gradio app, with the same result. Additionally, the duration calculator seems to be broken for Chinese - it makes the output twice as fast when set to auto.

This is the demo page result:
https://vocaroo.com/15JxVNPRScwD

This is mine:
https://vocaroo.com/13b14dZCkNau

How To Reproduce

Steps to reproduce the behavior:
Follow the instructions to install on Win 11 with special phonemizer and generate audio

Expected behavior

Quality should be the same as the examples

Screenshots

Environment Information

Operating System: Windows 11
Python Version: Python 3.10.15
Driver & CUDA Version: Driver 546.92 & CUDA 12.4
Error Messages and Logs: Posted it above, here quoted version:

./models/tts/maskgct/g2p\sources\g2p_chinese_model\poly_bert_model.onnx
Start loading: facebook/w2v-bert-2.0
D:\AIMaskGCTTTS\Amphion\models\tts\maskgct\gradio_demo.py:103: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
stat_mean_var = torch.load("./models/tts/maskgct/ckpt/wav2vec2bert_stats.pt")
D:\AIMaskGCTTTS\venv\lib\site-packages\torch\nn\utils\weight_norm.py:143: FutureWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
WeightNorm.apply(module, name, dim)
Models built successfully.
Checkpoints downloaded successfully.
Checkpoints loaded successfully.

Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
===== New task submitted =====
Start inference...
Audio loaded.
D:\AIMaskGCTTTS\venv\lib\site-packages\whisper_init_.py:150: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(fp, map_location=device)
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Ozkan\AppData\Local\Temp\jieba.cache
Loading model cost 0.321 seconds.
Prefix dict has been built successfully.
D:\AIMaskGCTTTS\Amphion\models\tts\maskgct\g2p\g2p\chinese_model_g2p.py:100: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_new.cpp:281.)
batch_label_starts = torch.tensor(batch_label_starts, dtype=torch.long)
Saved: ./output/output_0.wav
===== New task submitted =====
Start inference...
Audio loaded.
Saved: ./output/output_1.wav

Additional context

Thank you very much for this project

The text was updated successfully, but these errors were encountered:

yuantuo666 · 2024-11-09T15:30:07Z

Hi, for me, it sounds like the generated speech is trying to speak in a whisper style. You may run multiple times to get the best results. Besides, this can be improved by fine-tuning with high-quality whisper speeches or adding more whisper speeches in the training stage.
Here, MaskGCT is not designed for whisper speech generating, and we just found this feature when testing the model. This is why it cannot generate the whispered speech every time.

Answer for: #340 (comment)

TKsavy and yuantuo666, I can run the project on my Windows 11, but I couldn't reproduce the demo page examples. What could be the problem? For example, the whisper voice example on the demo page: I downloaded the sample from there and generated the same text, but it always outputs something between a whisper and a low voice, whereas the demo page examples are successful clones. My generations are generally of lower quality, regardless of the steps I take; I've tried up to 100 iterations.

I've also tried every version, including this one, the Windows fork, and Google Colab (to try it on a Linux environment), but all of them produce inferior results compared to your examples. Are the shared models from a previous training point, by any chance? Are you able to reproduce those results with the current shared models?

This was my issue for this matter with detailed logs and outputs: #334

Since I did not participate in the training of MaskGCT or demo generating, I did not know the details. Could @HeCheng0625 help with this?

steven8274 · 2024-11-15T08:35:11Z

I have the same problem.I followed steps on page 'https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct', and started the gradio demo.But I found the audio generated is not as good as the official examples.I used the audio downloaded from 'https://maskgct.github.io/audios/icl_smaples/icl_10.wav' and use target text '顿时，气氛变得沉郁起来。乍看之下，一切的困扰仿佛都围绕在我身边。我皱着眉头，感受着那份压力，但我知道我不能放弃，不能认输。于是，我深吸一口气，心底的声音告诉我：“无论如何，都要冷静下来，重新开始。”', both in the first example of 'Zero-shot In-context Learning'.

GalenMarek14 added the bug Something isn't working label Nov 6, 2024

This was referenced Nov 7, 2024

RuntimeError: failed to load voice "ja" #323

Open

[BUG]: RuntimeError: Error while trying to find names to remove to save state dict #340

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Lower quality than the examples on the demo page #334

[BUG]: Lower quality than the examples on the demo page #334

GalenMarek14 commented Nov 6, 2024

yuantuo666 commented Nov 9, 2024

steven8274 commented Nov 15, 2024

[BUG]: Lower quality than the examples on the demo page #334

[BUG]: Lower quality than the examples on the demo page #334

Comments

GalenMarek14 commented Nov 6, 2024

Describe the bug

How To Reproduce

Expected behavior

Screenshots

Environment Information

Additional context

yuantuo666 commented Nov 9, 2024

steven8274 commented Nov 15, 2024