You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After following the installation instructions (plus replacing phonemizer with https://github.com/justinjohn0306/phonemizer to make it work on Win 11), and using the same examples from the demo page, I was unable to replicate the quality of the examples. For example, the whispering voice always outputs something between a whisper and a normal voice. I tried both the inference script and the Gradio app, with the same result. Additionally, the duration calculator seems to be broken for Chinese - it makes the output twice as fast when set to auto.
Steps to reproduce the behavior:
Follow the instructions to install on Win 11 with special phonemizer and generate audio
Expected behavior
Quality should be the same as the examples
Screenshots
Environment Information
Operating System: Windows 11
Python Version: Python 3.10.15
Driver & CUDA Version: Driver 546.92 & CUDA 12.4
Error Messages and Logs: Posted it above, here quoted version:
./models/tts/maskgct/g2p\sources\g2p_chinese_model\poly_bert_model.onnx
Start loading: facebook/w2v-bert-2.0
D:\AIMaskGCTTTS\Amphion\models\tts\maskgct\gradio_demo.py:103: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
stat_mean_var = torch.load("./models/tts/maskgct/ckpt/wav2vec2bert_stats.pt")
D:\AIMaskGCTTTS\venv\lib\site-packages\torch\nn\utils\weight_norm.py:143: FutureWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
WeightNorm.apply(module, name, dim)
Models built successfully.
Checkpoints downloaded successfully.
Checkpoints loaded successfully.
To create a public link, set share=True in launch().
===== New task submitted =====
Start inference...
Audio loaded.
D:\AIMaskGCTTTS\venv\lib\site-packages\whisper_init_.py:150: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(fp, map_location=device)
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Ozkan\AppData\Local\Temp\jieba.cache
Loading model cost 0.321 seconds.
Prefix dict has been built successfully.
D:\AIMaskGCTTTS\Amphion\models\tts\maskgct\g2p\g2p\chinese_model_g2p.py:100: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_new.cpp:281.)
batch_label_starts = torch.tensor(batch_label_starts, dtype=torch.long)
Saved: ./output/output_0.wav
===== New task submitted =====
Start inference...
Audio loaded.
Saved: ./output/output_1.wav
Additional context
Thank you very much for this project
The text was updated successfully, but these errors were encountered:
Hi, for me, it sounds like the generated speech is trying to speak in a whisper style. You may run multiple times to get the best results. Besides, this can be improved by fine-tuning with high-quality whisper speeches or adding more whisper speeches in the training stage.
Here, MaskGCT is not designed for whisper speech generating, and we just found this feature when testing the model. This is why it cannot generate the whispered speech every time.
TKsavy and yuantuo666, I can run the project on my Windows 11, but I couldn't reproduce the demo page examples. What could be the problem? For example, the whisper voice example on the demo page: I downloaded the sample from there and generated the same text, but it always outputs something between a whisper and a low voice, whereas the demo page examples are successful clones. My generations are generally of lower quality, regardless of the steps I take; I've tried up to 100 iterations.
I've also tried every version, including this one, the Windows fork, and Google Colab (to try it on a Linux environment), but all of them produce inferior results compared to your examples. Are the shared models from a previous training point, by any chance? Are you able to reproduce those results with the current shared models?
This was my issue for this matter with detailed logs and outputs: #334
Since I did not participate in the training of MaskGCT or demo generating, I did not know the details. Could @HeCheng0625 help with this?
Describe the bug
After following the installation instructions (plus replacing phonemizer with https://github.com/justinjohn0306/phonemizer to make it work on Win 11), and using the same examples from the demo page, I was unable to replicate the quality of the examples. For example, the whispering voice always outputs something between a whisper and a normal voice. I tried both the inference script and the Gradio app, with the same result. Additionally, the duration calculator seems to be broken for Chinese - it makes the output twice as fast when set to auto.
This is the demo page result:
https://vocaroo.com/15JxVNPRScwD
This is mine:
https://vocaroo.com/13b14dZCkNau
How To Reproduce
Steps to reproduce the behavior:
Follow the instructions to install on Win 11 with special phonemizer and generate audio
Expected behavior
Quality should be the same as the examples
Screenshots
Environment Information
Additional context
Thank you very much for this project
The text was updated successfully, but these errors were encountered: