Skip to content

RVC (Retrieval‐based Voice Conversion)

erew123 edited this page Oct 5, 2024 · 9 revisions

RVC enhances TTS by replicating voice characteristics for characters or narrators, adding depth to synthesized speech. It functions as a TTS-to-TTS pipeline and can be used with any TTS engine/model. For optimal performance, it's recommended to use a voice cloning TTS engine like Coqui XTTS with voice samples.

Setup

When you first enable RVC on the Global Settings > RVC Settings tab and click the Update RVC Settings button, AllTalk will create the necessary folders and download any missing model files required for RVC to work.

Voice Model Files

  • Store voice models in the /models/rvc_voices/{subfolder} directory in their own individual subfolder. The rvc_voices folder is created when RVC is enabled in the Gradio interface.
  • A voice model typically includes a PTH file and potentially an index file.
  • If an index file is present, AllTalk will automatically select and use it.
  • If multiple index files are found, none will be used, and a message will be output to the console.
  • You can find pre-generated RVC voice models on sites like voice-models.com and Hugging Face.

/models
└── /rvc_voices
        ├── /voice_model_1
        │      ├── model.pth
        │      └── index.json
        └── /voice_model_2
                │── model.pth
                └── index.json

Purpose of the Index File

The index file helps improve the quality of the generated audio by providing a reference during the conversion process. The FAISS index enables faster and more accurate retrieval of voice characteristics, leading to more natural and high-quality voice synthesis.

RVC Settings

Default Character Voice Model

  • Selects the voice model used for character conversion.
  • If "Disabled" is selected, RVC will not be applied to character voices.
  • This option is used only if RVC is enabled and no other voice is specified in the API request.

Default Narrator Voice Model

  • Selects the voice model used for narrator conversion.
  • If "Disabled" is selected, RVC will not be applied to the narrator voice.
  • This option is used only if RVC is enabled and no other voice is specified in the API request.

Index Influence Ratio

  • Sets the influence exerted by the index file on the final output.
  • A higher value increases the impact of the index, potentially enhancing detail but also increasing the risk of artifacts.

Pitch

  • Sets the pitch of the audio output.
  • Increasing the value raises the pitch, while decreasing the value lowers it.

Volume Envelope

  • Substitutes or blends with the volume envelope of the output.
  • A ratio closer to 1 means the output envelope is more heavily employed.

Protect Voiceless Consonants/Breath Sounds

  • Prevents artifacts in voiceless consonants and breath sounds.
  • Higher values (up to 0.5) provide stronger protection but might affect indexing.

AutoTune

  • Enables or disables auto-tune for the generated audio.
  • Recommended for singing conversions to ensure the output remains in tune.

Filter Radius

  • If the number is greater than or equal to three, employing median filtering on the collected tone results has the potential to decrease respiration.

Training Data Size (AllTalk Specific)

  • Determines the number of training data points used to train the FAISS index.
  • Increasing the size may improve the quality of the output but can also increase computation time.
  • Different index files have different sizes. This setting limits the maximum amount of the index used.

Embedder Model

  • Select between different models for learning speaker embedding.
  • Options:
    • hubert: Focuses on capturing phonetic and linguistic content.
    • contentvec: Captures more detailed voice characteristics and nuances.

Split Audio

  • Splits the audio into chunks for inference to obtain better results in some cases.
  • Can improve the quality of conversion, especially for longer audio inputs.

Pitch Extraction Algorithm

  • Choose the algorithm used for extracting the pitch (F0) during audio conversion.
  • Options include:
    • crepe: High accuracy, robust against noise.
    • crepe-tiny: Smaller, faster version of crepe with slightly reduced accuracy.
    • dio: Fast, less accurate, suitable for real-time applications.
    • fcpe: Focuses on precise pitch extraction.
    • harvest: Produces smooth and natural pitch contours.
    • hybrid[rmvpe+fcpe]: Combines strengths of rmvpe and fcpe.
    • pm: Robust algorithm with a balance of speed and accuracy.
    • rmvpe: Recommended for most cases, especially in TTS applications.
Clone this wiki locally