- EnCodec SOTA deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio
- audio-webui A web-based UI for various audio-related Neural Networks with features like text-to-audio, voice cloning, and automatic-speech-recognition using Bark, AudioLDM, AudioCraft, RVC, coqui-ai and Whisper
- tts-generation-webui for all things TTS, currently supports Bark v2, MusicGen, Tortoise, Vocos
- Speechbrain A PyTorch-based Speech Toolkit for TTS, STT, etc
- Nvidia NeMo TTS, LLM, Audio Synthesis framework
- speech-rest-api for Speech-To-Text and Text-To-Speech with Whisper and Speechbrain
- LangHelper language learning through Text-to-speech + chatGPT + speech-to-text to practise speaking assessments, memorizing words and listening tests
- Silero-models pre-trained speech-to-text, text-to-speech and text-enhancement for ONNX, PyTorch, TensorFlow, SSML
- AI-Waifu-Vtuber AI Waifu Vtuber & is a virtual streamer. Supports multiple languages and uses VoiceVox, DeepL, Whisper, Seliro TTS, and VtubeStudio, and now also supports Twitch streaming.
- Voicebox large-scale text-guided generative speech model using non-autoregressive flow-matching, paper, demo, pytorch implementation, implementation
- Auto-Synced-Translated-Dubs Automatic YouTube video speech to text, translation, text to speech in order to dub a whole video
- SeamlessM4T Foundational Models for SOTA Speech and Text Translation
- Whisper SOTA local open-source speech recognition in many languages and translation into English
- Whisper JAX implementation runs around 70x faster on CPU, GPU and TPU
- whisper.cpp C/C++ port for Intel and ARM based Mac OS, ANdroid, iOS, Linux, WebAssembly, Windows, Raspberry Pi
- faster-whisper-livestream-translator A buggy proof of concept for real-time translation of livestreams using Whisper models, with suggestions for improvements including noise reduction and dual language subtitles
- Buzz Mac GUI for Whisper
- whisperX Fast automatic speech recognition (70x realtime with large-v2) using OpenAI's Whisper, word-level timestamps, speaker diarization, and voice activity detection
- ermine-ai | Whisper in the browser using transformers.js
- wav2vec2 dimensional emotion model
- MeetingSummarizer using Whisper and GPT3.dd
- Facebook MMS: Speech recognition of over 1000 languages
- Bark transformer-based text-to-audio model by Suno. Can generate highly realistic, multilingual speech and other audio like music, background noise and simple effects
- Bark-Voice-Clones
- Bark WebUI colab notebooks
- bark-with-voice-clone
- Bark Infinity for longer audio
- Bark WebUI
- bark-voice-cloning-HuBERT-quantizer Voice cloning with bark in high quality using Python 3.10 and ggingface models.
- bark-gui Gradio Web UI for an extended Bark version, with long generation, cloning, SSML, for Windows and other platforms, supporting NVIDIA/Apple GPU/CPU,
- bark-voice-cloning for chinese speech, based on bark-gui by C0untFloyd
- Barkify unoffical training implementation of Bark TTS by suno-ai
- Bark-RVC Multilingual Speech Synthesis Voice Conversion using Bark + RVC
- Coqui TTS | deep learning toolkit for Text-to-Speech
- Tutorial for Coqui VITS and Whisper to automate voice cloning and Colab notebook
- StyleTTS implementation
- StyleTTS-VC One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models
- Vall-E and Vall-E X, paper, code. Zero Shot TTS preserving emotion, expression, similarity and allows language transfer
- Vall-e PyTorch Implementation of Vall-E based on EnCodec tokenizer
- Vall-E PyTorch implementation
- Vall-E X open source implementation of Microsoft's VALL-E X zero-shot TTS model
- NaturalSpeech implmenetation
- naturalspeech2-pytorch Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch
- IMS Toucan, TTS Toolkit from University of Stuttgart
- YourTTS | Zero Shot Multi Speaker TTS and Voice Conversion for everyone
- PaddleSpeech | Easy to use Speech Toolkit with Self Supervised learning, SOTA Streaming with punctuation, TTS, Translation etc
- Tortoise TTS | Open source multi voice TTS system
- finetune guide using DLAS DL-Art-School, Master Deep Voice Cloning in Minutes
- DL-Art-School fine tuning tortoise with DLAS GUI
- tortoise-tts-fast fast Tortoise TTS inference up to 5x. Video tutorial
- Tortoise mrq fork for voice cloning
- piper A fast, local neural text to speech system that sounds great and is optimized for the Raspberry Pi 4. Using VITS and onnxruntime
- PITS PyTorch implementation of Variational Pitch Inference for End-to-end Pitch-controllable TTS. hf demo, samples
- VoiceCloning Implementing the YourTTS paper for Zero-Shot multi-speaker Attention-Based TTS using VITS approaches
- VITS-Umamusume-voice-synthesizer (Multilingual Anime TTS) Including Japanese TTS, Chinese and English TTS, speakers are all anime characters.
- Parallel WaveGAN implementation in PyTorch for high quality text to speech synthesis paper
- real-time-voice SV2TTS voice cloning TTS implementation using WaveRNN, Tacatron, GE2E
- voicebox-pytorch Implementation of Voicebox, new SOTA Text-to-speech network from MetaAI, in Pytorch
- voicepaw/so-vits-svc-fork SoftVC VITS Singing Voice Conversion Fork with realtime support and greatly improved interface. Based on so-vits-svc 4.0 (v1)
- Video tutorial by Nerdy Rodent
- nateraw/so-vits-svc-fork gradio app for inference of so-vits-svc-fork voice models + (training in colab with yt downloader and audio splitter, hf space demo)
- so-vits-svc-5.0
- LoRa svc singing voice conversion based on whisper, and lora
- RVC-Project simple and easy-to-use voice transformation (voice changer) web GUI based on VITS
- w-okada/voice-changer supports MMVC, so-vits-svc, RVC, DDSP-SVC, processing offloading over LAN, real time conversion
- DDSP-SVC Real-time singing voice conversion based on DDSP, training and inference uses lower requirements than diff-svc and so-vits-svc
- Leader board of SOTA models for stem separation using model ensembles in UVR
- VITS GUI to load VITS text to speech models
- Vits-fast-fine-tuning pipeline of VITS finetuning for fast speaker adaptation TTS, and many-to-many voice conversion
- AI-Cover-Song a google colab to do singing voice conversion with so-vits-svc-fork
- hf-rvc a package for RVC implementation using HuggingFace's transformers with the capability to convert from original unsafe models to HF models and voice conversion tasks
- VitsServer A VITS ONNX server designed for fast inference
- jax-so-vits-svc-5.0 Rewrite so-vits-svc-5.0 in jax
- w-okada/voice-changer | real time voice conversion using various models like MMVC, so-vits-svc, RVC, DDSP-SVC
- Diff-svc Singing Voice Conversion via Diffusion model
- FastDiff implementation| Fast Conditional Diffusion Model for High-Quality Speech Synthesis
- Fish Diffusion easy to understand TTS / SVS / SVC framework, can convert Diff models
- Real-Time-Voice-Cloning abandoned project
- Real-Time-Voice-Cloning v2 active fork of the original for google collab
- Raven with voice cloning 2.0 by Kevin676
- CoMoSpeech consistency model distilled from a diffusion-based teacher model, enabling high-quality one-step speech and singing voice synthesis
- NS2VC WIP Unofficial implementation of NaturalSpeech2 for Voice Conversion
- vc-lm train an any-to-one voice conversion models, referncing vall-e, using encodec to create tokens and building a transformer language model on tokens
- knn-vc official implementation of Voice Conversion With Just Nearest Neighbors (kNN-VC) contains training and inference for any-to-any voice conversion model, paper, examples
- FreeVC High-Quality Text-Free One-Shot Voice Conversion including pretrained models HF demo, examples
- TriAAN-VC a Pytorch deep learning model for any-to-any voice conversion, with SOTA performance achieved by using an attention-based adaptive normalization block to extract target speaker representations while minimizing the loss of the source content. demo, paper
- EasyVC toolkit supporting various encoders and decoders, focusing on challenging VC scenarios such as one-shot, emotional, singing, and real-time. demo
- MoeVoiceStudio GUI supporting JOKE, SoVits, DiffSvc, DiffSinger, RVC, FishDiffusion
- MockingBird Clone a voice in 5 seconds to generate arbitrary speech in real-time
- weeablind dub multi lingual media using modern AI speech synthesis, diarization, and language identification
- Auto-synced-translated-dubs Youtube audio translation and dubbing pipeline using Whisper speech-to-text, Google/DeepL Translate, Azure/Google TTS
- videodubber dub video using GCP TTS, Translate, Whisper, Spacy tokenization and syllable counting
- TranslatorYouTuber Takes a youtube video, clones the voice and re-creates that video in a different language
- global-video-dubbing Using Googel Cloud Video Intelligence API with Cloud Translation API and Cloud Text to Speech API to generate voice dubbing and tranaslations in many languages automatically
- wav2lip Lip Syncing from audio
- Wav2Lip-GFPGAN High quality Lip sync with wav2lip + Tencent GFPGAN
- audiocraft library for audio processing and generation with deep learning using EnCodec compressor / tokenizer and MusicGen support
- audiocraft-infinity-webui webui supporting generation longer than 30 seconds, song continuation, seed option, load local models from chavinlo's training repo, MacOS/linux support, running on CPU/gpu
- musicgen_trainer simple trainer for musicgen/audiocraft
- audiocraft-webui basic webui with support for long audio, segmented audio and processing queue
- audiocraft-webui another basic webui, unknown feature set
- MusicGeneration a streamlit gui for audiocraft and musicgen
- audiocraftgui with wxPython supporting continuous generation by using chunks and overlaps
- MusicGen a simple and controllable model for music generation using a Transformer model examples, colab, colab collection
- audiocraft-infinity-webui generation length over 30 seconds, ability to continue songs, seeds, allows to load local models
- AudioCraft Plus an all-in-one WebUI for the original AudioCraft, adding multiband diffusion, continuation, custom model support, mono to stereo and more
- AudioLDM Generate speech, sound effects, music and beyond, with text code, paper, HF demo
- Separate Anything You Describe Describe what you want to isolate from audio, Language-queried audio source separation (LASS), paper
- Vocos Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
- WavJourney Compositional Audio Creation with LLMs github
- PromptingWhisper Audio-Visual Speech Recognition, Code-Switched Speech Recognition, and Zero-Shot Speech Translation for Whisper