Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help]: How to use MaskGCT to generate translated audio from an english video? #335

Open
KylinMountain opened this issue Nov 6, 2024 · 4 comments

Comments

@KylinMountain
Copy link

Problem Overview

I have a video speaking english, and I want it to say Chinese in the same speed, keep synchronize between video and audio.

How to that? Is there any instruction? Thank you.

Steps Taken

(Detail your attempts to resolve the issue, including any relevant steps or processes.)

  1. Config/File changes: ...
  2. Run command: ...
  3. See errors: ...

Expected Outcome

(A clear and concise description of what you expected to happen.)

Screenshots

(If applicable, add screenshots to help explain your problem.)

Environment Information

  • Operating System: [e.g. Ubuntu 20.04.5 LTS]
  • Python Version: [e.g. Python 3.9.15]
  • Driver & CUDA Version: [e.g. Driver 470.103.01 & CUDA 11.4]
  • Error Messages and Logs: [If applicable, provide any error messages or relevant log outputs]

Additional context

(Add any other context about the problem here.)

@synthere
Copy link

synthere commented Nov 8, 2024

The general steps might be taken:

  1. get text from the audio in the english video, using such tool like whisper or funasr;
  2. translate the text into Chinese text;
  3. generate Chinese audio from the translated chinese text in 2), using tts tool like MaskGCT
  4. resync the audio with the original video.

@KylinMountain
Copy link
Author

@synthere if try like this, we can't copy the accent of the orginal audio and control the tts speed as the original one.

@synthere
Copy link

synthere commented Nov 8, 2024

The accent could be cloned using the voice cloning function, and the tts speed can be adjusted also. Actually, I just created a video dubbing tool the other day, which u may have a try here syntheredub

@synthere
Copy link

synthere commented Nov 8, 2024

I also tried the maskgct, which can control the target duration. But the resulted audio is not exactly aligned with the original as shown below(Top is the original audio, the bottom generated).
image

So precise alignment and resynchronization are sometimes necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants