-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FreeVC implementation #201
base: main
Are you sure you want to change the base?
Conversation
Here are some examples of our results: 1_src.mp41_dst.mp41_output.mp42_src.mp42_dst.mp42_output.mp43_src.mp43_dst.mp43_output.mp4 |
The quality of the samples sounds good. @Adorable-Qin Please check the code and document carefully. |
Here are some examples of our results, using the checkpoint of 183 epoch(120k steps) training(while above examples are from the pretrained checkpoint): 1_src.mp41_tgt.mp41_output.mp42_src.mp42_tgt.mp42_output.mp43_src.mp43_tgt.mp43_output.mp4 |
Our AutoDL server will expire tomorrow. Here is a demo video recording the training status. demo-video.mp4 |
|
||
@torch.no_grad() | ||
def load_sample(self, filename): | ||
filepath = os.path.join(self.vctk_16k_dir, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this line hard-coded? Is it possible to select datasets in config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation only trains on VCTK dataset.
- Data preprocessing uses the file structure of VCTK dataset to retrieve speaker tags.
- When splitting train/val/test set, every speaker's samples are split randomly. It ensures that every speaker is in train & val & test set.
It's possible to support other datasets if we can perform the same operations on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your explanation!
@RMSnow For this implementation, do we expect a universal model that can be trained on any dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a universal FreeVC for any datasets is welcome. I think only the FreeVC's model part need to be integrated in Amphion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole directory (models/vc/FreeVC/speaker_encoder
) is copied from
- https://github.com/OlaWod/FreeVC/tree/81c169cdbfc97ff07ee2f501e9b88d543fc46126/speaker_encoder
(MIT license) - https://github.com/liusongxiang/ppg-vc/tree/b59cb9862cf4b82a3bdb589950e25cab85fc9b03/speaker_encoder
(Apache-2.0 license)
We keep it unchanged to match the original implementation.
However, it may be a problem if we copy so much code and a pretrained ckpt from other repo. I'm not sure what is the best practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your explanation.
@RMSnow Any advice about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Adorable-Qin I think introducing such a pretrained speaker encode is acceptable. It is just like WeNet. However, please add some acknowledge in our main readme before integrating it.
BTW, I think the .pt.txt
file is strange. If it is a pretrained model, we can follow our pretrained model's part to integrate.
models/vc/FreeVC/speaker_encoder/data_objects/speaker_verification_dataset.py
Show resolved
Hide resolved
models/vc/FreeVC/train.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to support multi-GPU training using external library like the Accelerate used in Amphion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have tried multi-GPU training in another repo. We use lightning framework to automatically enable DDP training. But it exits with error soon after starting. Single GPU works well.
✨ Description
FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
This PR is a part of AIR6063 final project.
FYI, we also have another repo which refactors the training pipeline. Both the PR code and the custom code can produce good checkpoints.
Here are our checkpoints trained with PR code on single NVIDIA RTX4090
🚧 Related Issues
During the project, we have opened some issues and another PR to help improve Amphion.
preprocessors/popbutfy.py
may be incorrect #196👨💻 Changes Proposed
🧑🤝🧑 Who Can Review?
[Please use the '@' symbol to mention any community member who is free to review the PR once the tests have passed. Feel free to tag members or contributors who might be interested in your PR.]
@zhizhengwu @RMSnow @Adorable-Qin
✅ Checklist