Like OpenAI's realtime voice AI, using open source models only.
vid.mp4
There are four components involved, mainly a Pipecat server, found in this repository which orchestrates the end-to-end pipeline, and three distinct models for each step in the pipeline: speech-to-text (STT) model, instruction-tuned text completion model and text-to-speech model (TTS). In this case, Whisper, SEA-LIONv2 and XTTS is used correspondingly. Other models can be substituted, such as Parler-TTS for TTS with some adjustments to the current code.
To support realtime capabilities, GPU-acceleration is required for running models, so you will need to host each model on a L40s GPU minimally based on my experience, to achieve realtime performance. The hosted server should then be integrated to the interface provided by Pipecat, which might require additional effort to do so depending on how the substitute model encodes input and decodes output. This part can be tricky and model dependent, especially for the STT and TTS step. See section Configuring STT and TTS Servers
below for more details, especially on known concurrency issue for TTS server.
- Install dependencies
python -m venv venv
source venv/bin/active # or OS equivalent
pip install -r requirements.txt
- Setup .env
cp env.example .env
Alternatively, you can configure your Modal app to use secrets
- Test the app locally
modal serve app.py # run the server
curl -X POST https://{modal_dev_url} # POST request to create Daily Room
- Deploy to production
modal deploy app.py
This app sets some sensible defaults for reducing cold starts, such as minkeep_warm=1
, which will keep at least 1 warm instance ready for your bot function.
It has been configured to only allow a concurrency of 1 (max_inputs=1
) as each user will require their own running function.
My Whisper streaming server implementation can be found here, which is ran using docker. Clone the repository to a GPU machine with Docker set up and run docker compose. Multithreaded streaming for this STT server should not be an issue.
XTTS streaming server is ran here using docker. Just clone into a GPU machine separate from the STT server and it should run. You could explore having both STT/TTS with a group of GPUs but will need to provide the batching logic for this.
**Currently multithreaded inference for TTS here is not implemented yet, which prevents the voice bot from serving multiple users at once. Implementing concurrent streaming decoding, required for realtime capabilities, seems to be non-trivial, and need to be patched for this demo to go live and serve multiple users.
Having tried model hosting services like Replicate, Modal, RunPod, LambdaLabs, I find the simplest way to deploy the inference servers to be a canonical AWS GPU server. You can use my Packer image to easily set up CUDA and Docker dependencies. The same image is used to set up both Whisper and XTTS servers above.
Hosting inference servers can be quite expensive, a single L40S up for 24 hours will cost about ~$5000/month which make API services more attractive.
Another option is to explore serverless inference for perpetual uptime or a using AWS's native scheduler to trigger a Lambda to shut down the server overnight if only used for development.
Code for the user interface used in the video can be found here.
Alternatively, you can call the modal endpoint using curl -X POST {MODAL_URL}
and recieve the daily room_url.
Example curl reponse:
{"room_url":"https://ob1-aisg.daily.co/pcJMY4YkWMrKuNdvHSA5","eyJhbGciOiJIUzI1NiIsIn...":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJyI..."}
- Modal Seamless M4T, code