Realtime Voice AI using Open Source models only

Like OpenAI's realtime voice AI, using open source models only.

vid.mp4

Set Up

Overview

There are four components involved, mainly a Pipecat server, found in this repository which orchestrates the end-to-end pipeline, and three distinct models for each step in the pipeline: speech-to-text (STT) model, instruction-tuned text completion model and text-to-speech model (TTS). In this case, Whisper, SEA-LIONv2 and XTTS is used correspondingly. Other models can be substituted, such as Parler-TTS for TTS with some adjustments to the current code.

To support realtime capabilities, GPU-acceleration is required for running models, so you will need to host each model on a L40s GPU minimally based on my experience, to achieve realtime performance. The hosted server should then be integrated to the interface provided by Pipecat, which might require additional effort to do so depending on how the substitute model encodes input and decodes output. This part can be tricky and model dependent, especially for the STT and TTS step. See section Configuring STT and TTS Servers below for more details, especially on known concurrency issue for TTS server.

Pipecat Orchestration Server

Install dependencies

python -m venv venv
source venv/bin/active # or OS equivalent
pip install -r requirements.txt

Setup .env

cp env.example .env

Alternatively, you can configure your Modal app to use secrets

Test the app locally

modal serve app.py # run the server
curl -X POST https://{modal_dev_url} # POST request to create Daily Room

Deploy to production

modal deploy app.py

Configuration options

This app sets some sensible defaults for reducing cold starts, such as minkeep_warm=1, which will keep at least 1 warm instance ready for your bot function.

It has been configured to only allow a concurrency of 1 (max_inputs=1) as each user will require their own running function.

Configuring STT and TTS Servers

Speech-to-Text (STT)

My Whisper streaming server implementation can be found here, which is ran using docker. Clone the repository to a GPU machine with Docker set up and run docker compose. Multithreaded streaming for this STT server should not be an issue.

Text-to-Speech (TTS)

XTTS streaming server is ran here using docker. Just clone into a GPU machine separate from the STT server and it should run. You could explore having both STT/TTS with a group of GPUs but will need to provide the batching logic for this.

**Currently multithreaded inference for TTS here is not implemented yet, which prevents the voice bot from serving multiple users at once. Implementing concurrent streaming decoding, required for realtime capabilities, seems to be non-trivial, and need to be patched for this demo to go live and serve multiple users.

Cloud Hosting Set Up

Having tried model hosting services like Replicate, Modal, RunPod, LambdaLabs, I find the simplest way to deploy the inference servers to be a canonical AWS GPU server. You can use my Packer image to easily set up CUDA and Docker dependencies. The same image is used to set up both Whisper and XTTS servers above.

Hosting inference servers can be quite expensive, a single L40S up for 24 hours will cost about ~$5000/month which make API services more attractive.

Another option is to explore serverless inference for perpetual uptime or a using AWS's native scheduler to trigger a Lambda to shut down the server overnight if only used for development.

User Interface

Code for the user interface used in the video can be found here.

Alternatively, you can call the modal endpoint using curl -X POST {MODAL_URL} and recieve the daily room_url.

Example curl reponse:

{"room_url":"https://ob1-aisg.daily.co/pcJMY4YkWMrKuNdvHSA5","eyJhbGciOiJIUzI1NiIsIn...":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJyI..."}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
app.py		app.py
bot.py		bot.py
env.example		env.example
requirements.txt		requirements.txt
runner.py		runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Realtime Voice AI using Open Source models only

Set Up

Overview

Pipecat Orchestration Server

Configuration options

Configuring STT and TTS Servers

Speech-to-Text (STT)

Text-to-Speech (TTS)

Cloud Hosting Set Up

User Interface

References

Audio foundation models

Fusion models

OpenAI Compatible Whisper Servers

Whisper Streaming Servers

About

Releases

Packages

Languages

aisingapore/voice_ai_pipeline

Folders and files

Latest commit

History

Repository files navigation

Realtime Voice AI using Open Source models only

Set Up

Overview

Pipecat Orchestration Server

Configuration options

Configuring STT and TTS Servers

Speech-to-Text (STT)

Text-to-Speech (TTS)

Cloud Hosting Set Up

User Interface

References

Audio foundation models

Fusion models

OpenAI Compatible Whisper Servers

Whisper Streaming Servers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages