Audio

Compression

EnCodec SOTA deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio

Multiple Tasks

audio-webui A web-based UI for various audio-related Neural Networks with features like text-to-audio, voice cloning, and automatic-speech-recognition using Bark, AudioLDM, AudioCraft, RVC, coqui-ai and Whisper
tts-generation-webui for all things TTS, currently supports Bark v2, MusicGen, Tortoise, Vocos
Speechbrain A PyTorch-based Speech Toolkit for TTS, STT, etc
Nvidia NeMo TTS, LLM, Audio Synthesis framework
speech-rest-api for Speech-To-Text and Text-To-Speech with Whisper and Speechbrain
LangHelper language learning through Text-to-speech + chatGPT + speech-to-text to practise speaking assessments, memorizing words and listening tests
Silero-models pre-trained speech-to-text, text-to-speech and text-enhancement for ONNX, PyTorch, TensorFlow, SSML
AI-Waifu-Vtuber AI Waifu Vtuber & is a virtual streamer. Supports multiple languages and uses VoiceVox, DeepL, Whisper, Seliro TTS, and VtubeStudio, and now also supports Twitch streaming.
Voicebox large-scale text-guided generative speech model using non-autoregressive flow-matching, paper, demo, pytorch implementation, implementation
Auto-Synced-Translated-Dubs Automatic YouTube video speech to text, translation, text to speech in order to dub a whole video
SeamlessM4T Foundational Models for SOTA Speech and Text Translation

Speech Recognition

Whisper SOTA local open-source speech recognition in many languages and translation into English
- Whisper JAX implementation runs around 70x faster on CPU, GPU and TPU
- whisper.cpp C/C++ port for Intel and ARM based Mac OS, ANdroid, iOS, Linux, WebAssembly, Windows, Raspberry Pi
- faster-whisper-livestream-translator A buggy proof of concept for real-time translation of livestreams using Whisper models, with suggestions for improvements including noise reduction and dual language subtitles
- Buzz Mac GUI for Whisper
- whisperX Fast automatic speech recognition (70x realtime with large-v2) using OpenAI's Whisper, word-level timestamps, speaker diarization, and voice activity detection
ermine-ai | Whisper in the browser using transformers.js
wav2vec2 dimensional emotion model
MeetingSummarizer using Whisper and GPT3.dd
Facebook MMS: Speech recognition of over 1000 languages

TextToSpeech

Bark transformer-based text-to-audio model by Suno. Can generate highly realistic, multilingual speech and other audio like music, background noise and simple effects
- Bark-Voice-Clones
- Bark WebUI colab notebooks
- bark-with-voice-clone
- Bark Infinity for longer audio
- Bark WebUI
- bark-voice-cloning-HuBERT-quantizer Voice cloning with bark in high quality using Python 3.10 and ggingface models.
- bark-gui Gradio Web UI for an extended Bark version, with long generation, cloning, SSML, for Windows and other platforms, supporting NVIDIA/Apple GPU/CPU,
- bark-voice-cloning for chinese speech, based on bark-gui by C0untFloyd
- Barkify unoffical training implementation of Bark TTS by suno-ai
- Bark-RVC Multilingual Speech Synthesis Voice Conversion using Bark + RVC
Coqui TTS | deep learning toolkit for Text-to-Speech
- Tutorial for Coqui VITS and Whisper to automate voice cloning and Colab notebook
StyleTTS implementation
- StyleTTS-VC One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models
Vall-E and Vall-E X, paper, code. Zero Shot TTS preserving emotion, expression, similarity and allows language transfer
- Vall-e PyTorch Implementation of Vall-E based on EnCodec tokenizer
- Vall-E PyTorch implementation
- Vall-E X open source implementation of Microsoft's VALL-E X zero-shot TTS model
NaturalSpeech implmenetation
- naturalspeech2-pytorch Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch
IMS Toucan, TTS Toolkit from University of Stuttgart
YourTTS | Zero Shot Multi Speaker TTS and Voice Conversion for everyone
PaddleSpeech | Easy to use Speech Toolkit with Self Supervised learning, SOTA Streaming with punctuation, TTS, Translation etc
Tortoise TTS | Open source multi voice TTS system
- finetune guide using DLAS DL-Art-School, Master Deep Voice Cloning in Minutes
- DL-Art-School fine tuning tortoise with DLAS GUI
- tortoise-tts-fast fast Tortoise TTS inference up to 5x. Video tutorial
- Tortoise mrq fork for voice cloning
piper A fast, local neural text to speech system that sounds great and is optimized for the Raspberry Pi 4. Using VITS and onnxruntime
PITS PyTorch implementation of Variational Pitch Inference for End-to-end Pitch-controllable TTS. hf demo, samples
VoiceCloning Implementing the YourTTS paper for Zero-Shot multi-speaker Attention-Based TTS using VITS approaches
- VITS-Umamusume-voice-synthesizer (Multilingual Anime TTS) Including Japanese TTS, Chinese and English TTS, speakers are all anime characters.
Parallel WaveGAN implementation in PyTorch for high quality text to speech synthesis paper
real-time-voice SV2TTS voice cloning TTS implementation using WaveRNN, Tacatron, GE2E
voicebox-pytorch Implementation of Voicebox, new SOTA Text-to-speech network from MetaAI, in Pytorch

Voice Conversion

voicepaw/so-vits-svc-fork SoftVC VITS Singing Voice Conversion Fork with realtime support and greatly improved interface. Based on so-vits-svc 4.0 (v1)
- Video tutorial by Nerdy Rodent
- nateraw/so-vits-svc-fork gradio app for inference of so-vits-svc-fork voice models + (training in colab with yt downloader and audio splitter, hf space demo)
- so-vits-svc-5.0
- LoRa svc singing voice conversion based on whisper, and lora
- RVC-Project simple and easy-to-use voice transformation (voice changer) web GUI based on VITS
  - rvc-webui Win/Mac/Linux installer and Guide for RVC-Project
  - RVC-GUI fork of RVC for easy audio file voice conversion locally, only inference, no training
- w-okada/voice-changer supports MMVC, so-vits-svc, RVC, DDSP-SVC, processing offloading over LAN, real time conversion
- DDSP-SVC Real-time singing voice conversion based on DDSP, training and inference uses lower requirements than diff-svc and so-vits-svc
- Leader board of SOTA models for stem separation using model ensembles in UVR
- VITS GUI to load VITS text to speech models
- Vits-fast-fine-tuning pipeline of VITS finetuning for fast speaker adaptation TTS, and many-to-many voice conversion
- AI-Cover-Song a google colab to do singing voice conversion with so-vits-svc-fork
- hf-rvc a package for RVC implementation using HuggingFace's transformers with the capability to convert from original unsafe models to HF models and voice conversion tasks
- VitsServer A VITS ONNX server designed for fast inference
- jax-so-vits-svc-5.0 Rewrite so-vits-svc-5.0 in jax
w-okada/voice-changer | real time voice conversion using various models like MMVC, so-vits-svc, RVC, DDSP-SVC
Diff-svc Singing Voice Conversion via Diffusion model
FastDiff implementation| Fast Conditional Diffusion Model for High-Quality Speech Synthesis
Fish Diffusion easy to understand TTS / SVS / SVC framework, can convert Diff models
Real-Time-Voice-Cloning abandoned project
- Real-Time-Voice-Cloning v2 active fork of the original for google collab
Raven with voice cloning 2.0 by Kevin676
CoMoSpeech consistency model distilled from a diffusion-based teacher model, enabling high-quality one-step speech and singing voice synthesis
NS2VC WIP Unofficial implementation of NaturalSpeech2 for Voice Conversion
vc-lm train an any-to-one voice conversion models, referncing vall-e, using encodec to create tokens and building a transformer language model on tokens
knn-vc official implementation of Voice Conversion With Just Nearest Neighbors (kNN-VC) contains training and inference for any-to-any voice conversion model, paper, examples
FreeVC High-Quality Text-Free One-Shot Voice Conversion including pretrained models HF demo, examples
TriAAN-VC a Pytorch deep learning model for any-to-any voice conversion, with SOTA performance achieved by using an attention-based adaptive normalization block to extract target speaker representations while minimizing the loss of the source content. demo, paper
EasyVC toolkit supporting various encoders and decoders, focusing on challenging VC scenarios such as one-shot, emotional, singing, and real-time. demo
MoeVoiceStudio GUI supporting JOKE, SoVits, DiffSvc, DiffSinger, RVC, FishDiffusion
MockingBird Clone a voice in 5 seconds to generate arbitrary speech in real-time

Video Voice Dubbing

weeablind dub multi lingual media using modern AI speech synthesis, diarization, and language identification
Auto-synced-translated-dubs Youtube audio translation and dubbing pipeline using Whisper speech-to-text, Google/DeepL Translate, Azure/Google TTS
videodubber dub video using GCP TTS, Translate, Whisper, Spacy tokenization and syllable counting
TranslatorYouTuber Takes a youtube video, clones the voice and re-creates that video in a different language
global-video-dubbing Using Googel Cloud Video Intelligence API with Cloud Translation API and Cloud Text to Speech API to generate voice dubbing and tranaslations in many languages automatically
wav2lip Lip Syncing from audio
Wav2Lip-GFPGAN High quality Lip sync with wav2lip + Tencent GFPGAN

Music Generation

audiocraft library for audio processing and generation with deep learning using EnCodec compressor / tokenizer and MusicGen support
- audiocraft-infinity-webui webui supporting generation longer than 30 seconds, song continuation, seed option, load local models from chavinlo's training repo, MacOS/linux support, running on CPU/gpu
- musicgen_trainer simple trainer for musicgen/audiocraft
- audiocraft-webui basic webui with support for long audio, segmented audio and processing queue
- audiocraft-webui another basic webui, unknown feature set
- MusicGeneration a streamlit gui for audiocraft and musicgen
- audiocraftgui with wxPython supporting continuous generation by using chunks and overlaps
- MusicGen a simple and controllable model for music generation using a Transformer model examples, colab, colab collection
- audiocraft-infinity-webui generation length over 30 seconds, ability to continue songs, seeds, allows to load local models
- AudioCraft Plus an all-in-one WebUI for the original AudioCraft, adding multiband diffusion, continuation, custom model support, mono to stereo and more
AudioLDM Generate speech, sound effects, music and beyond, with text code, paper, HF demo

Audio Source Separation

Separate Anything You Describe Describe what you want to isolate from audio, Language-queried audio source separation (LASS), paper

Research

Vocos Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
WavJourney Compositional Audio Creation with LLMs github
PromptingWhisper Audio-Visual Speech Recognition, Code-Switched Speech Recognition, and Zero-Shot Speech Translation for Whisper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audio-ai.md

audio-ai.md

Audio

Compression

Multiple Tasks

Speech Recognition

TextToSpeech

Voice Conversion

Video Voice Dubbing

Music Generation

Audio Source Separation

Research

Files

audio-ai.md

Latest commit

History

audio-ai.md

File metadata and controls

Audio

Compression

Multiple Tasks

Speech Recognition

TextToSpeech

Voice Conversion

Video Voice Dubbing

Music Generation

Audio Source Separation

Research