Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick (Singing) Voice Conversion #200

Closed
wants to merge 3 commits into from
Closed

Conversation

CuiLvYing
Copy link

✨ Description

This is an implementation of a simple Webui which provides a simple and quick text-free one-shot voice conversion for the uninitiated. Thereotically, the user only takes two short audios (source and target) and a few minutes to receive the VC result.
It purposes to use the base model (checkpoint) trained from the VCTK, M4Singer datasets (or other supported datasets) as a foundation, and then fine-tune the base model using the input source audio for voice conversion and output. Now it supports MultipleContentSVC and VITS.

🚧 Related Issues

None

👨‍💻 Changes Proposed

If exists, please refer to the commits.

🧑‍🤝‍🧑 Who Can Review?

[Please use the '@' symbol to mention any community member who is free to review the PR once the tests have passed. Feel free to tag members or contributors who might be interested in your PR.]
@zhizhengwu @RMSnow @Adorable-Qin

🛠 TODO

✅ Checklist

  • Code has been reviewed
  • Code complies with the project's code standards and best practices
  • Code has passed all tests
  • Code does not affect the normal use of existing features
  • Code has been commented properly
  • Documentation has been updated (if applicable)
  • Demo/checkpoint has been attached (if applicable)

@RMSnow
Copy link
Collaborator

RMSnow commented May 7, 2024

Hi @CuiLvYing, thanks for your efforts! Would you please attach some demos (such as the generated voices or your WebUI's video) like PR #56?

@CuiLvYing
Copy link
Author

Of course! Here are some test demo videos or audios.

1.mp4
2.mp4
source.mp4

https://github.com/open-mmlab/Amphion/assets/16

result.5.mp4

6400963/f752ea9d-a950-4831-bd30-ffd9fb6fd6f5

You can even have a look at our running demo webui now: https://24a8ca30d15dff216c.gradio.live
This test uses MultipleContentSVC and takes at least 200 seconds to output. However, I think our pre-trained model checkpoint has some flaws (not trained enough) and may not have a good effect, and sorry for that.

@CuiLvYing
Copy link
Author

Sorry I find the using target audio not uploaded. Here is it:

target.mp4

@RMSnow
Copy link
Collaborator

RMSnow commented May 7, 2024

Hi @CuiLvYing, I'm confused about your samples. For VC, the converted audio will speak the source's content with the target's timbre. Please use your model to convert the samples of PR: #201. Then we can compare yours :)

@CuiLvYing
Copy link
Author

I think we are attempting to make the person from "Infsource" speak content of the "target", and this is just opposite to your definition, and we'll soon amend this.
Here are some audios after correction to the webui:

source1.mp4
target1.mp4
result1.mp4
source2.mp4
target2.mp4
result2.mp4
source3.mp4
target3.mp4
result3.mp4

@RMSnow
Copy link
Collaborator

RMSnow commented May 8, 2024

The naturalness, especically the intelligibility, is bad to me. So I recommend not to merge this PR unless there is a substantial improvement. @Adorable-Qin Please review the code and document carefully.

@RMSnow RMSnow closed this Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants