Voi is a free and open source backend for realtime voice agents. Check the JS client.
- 9Gb+ of GPU memory. I recommend GeForce RTX 3090 or better for a single worker.
- 8-core CPU, 32Gb RAM is enough.
- 10Gb of disk space.
- Ubuntu 22.04 or higher.
- Fresh Nvidia drivers (tested on 545+ driver versions).
- Docker with Nvidia runtime support.
- Caddy server.
- LiteLLM server.
Voi uses Docker Compose to run a server. It uses Docker mostly for runtime, while keeping source code, Python packages and model weights on the host file system. This was made intentionally to allow fast development.
There are two Docker environments to run Voi server, production and development. Basically they are the same, except the production config starts the Voi server automatically and uses a different port.
Get the sources.
git clone https://github.com/alievk/voi-core.git
cd voi-core
Copy your id_rsa.pub
into docker
folder to be able to ssh directly into the container.
cp ~/.ssh/id_rsa.pub docker/
Make a copy of docker/docker-compose-dev.example.yml
.
cp docker/docker-compose-dev.example.yml docker/docker-compose-dev.yml
In the docker-compose-dev.yml
, edit the environment
, ports
and volumes
sections as you need. If you need a Jupyter server, set your token in the JUPYTER_TOKEN
variable, otherwise it won't run for safety reasons.
Build and run the development container.
cd docker
./up-dev.sh
When the container is created, you will see voice-agent-core-container-dev
in docker ps
output, otherwise check docker-compose logs
for errors. If there were no errors, ssh daemon and Jupyter server will be listening on the ports defined in docker-compose-dev.yml
.
Connect via ssh into the container from e.g. your laptop:
ssh user@<host> -p <port>
where <host>
is the address of your host machine and port
is the port specified in docker-compose-dev.yml
. You will see a bash prompt like user@8846788f5e9c:~$
.
My personal recommendation is to add a config to your ~/.ssh/config
file to easily connect to the container:
Host voi_docker_dev
Hostname your_host_address
AddKeysToAgent yes
UseKeychain yes
User user
Port port_from_the_above
Then you do just this and get into the container:
ssh voi_docker_dev
In the container, install the Python dependencies:
cd voi-core
./install.sh
This step is intentionally not incorporated in the Dockerfile because at the active development stage you often change the requirements and don't want to rebuild the container each time. You won't need to do this each time when the container was restarted if you have mapped .local
directory properly in the docker-compose-dev.yml
.
The Voi server uses secure web socket connection and relies on Caddy, which nicely manages SSL certificates for us. Follow the docs to get it.
On your host machine, make sure you have proper config in the Caddyfile (usually /etc/caddy/Caddyfile
):
your_domain.com:8774 {
reverse_proxy localhost:8775
}
This will proxy secure web socket connection from 8775
to 8774
port.
LiteLLM allows calling all LLM APIs using OpenAI format, which is neat.
If you run a Voi server in a country restricted by OpenAI (like Russia or China), you will need to run a remote LiteLLM server in a closest unrestricted country. You can do this for just $10/mo using AWS Lightsail. These are minimal specs you need:
- 2 GB RAM, 2 vCPUs, 60 GB SSD
- Ubuntu
If you use AWS Lightsail, do not forget to add a custom TCP rule for the port 4000.
If you are not in the restricted region, you can run LiteLLM server locally on your host machine.
For the details of setting up LiteLLM, visit the repo, but basically you need to follow these steps.
Get the code.
git clone https://github.com/BerriAI/litellm
cd litellm
Add the master key - you can change this after setup.
echo 'LITELLM_MASTER_KEY="sk-1234"' > .env
source .env
Create models configuration file.
vim litellm_config.yaml
Example configuration:
model_list:
- model_name: gemini-1.5-flash
litellm_params:
model: openai/gemini-1.5-flash
api_key: your_googleapi_key
api_base: https://generativelanguage.googleapis.com/v1beta/openai
- model_name: meta-llama-3.1-70b-instruct-turbo
litellm_params:
model: openai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
api_key: your_deepinfra_key
api_base: https://api.deepinfra.com/v1/openai
The model
format is {API format}/{model name}
, where {API format}
is openai
/anthropic
and {model name}
is the model name in the provider's format (gpt-4o-mini
for OpenAI or meta-llama/Meta-Llama-3.1-8B-Instruct
for DeepInfra). Look at litellm_config.example.yaml for more examples.
- Start LiteLLM server.
docker-compose up
Before running the server, we need to set the environment variables and create agents config.
My typical workflow is to run the development environment and ssh to the container using Cursor (Connect to Host
-> voi_docker_dev
). In this way, I can edit source code and run the scripts in one place.
Make a copy of .env.example
.
# Assuming you are in the Voi root
cp .env.example .env
LITELLM_API_BASE
is the address of your LiteLLM server, likehttp://111.1.1.1:4000
ofhttp://localhost:4000
.LITELLM_API_KEY
isLITELLM_MASTER_KEY
from the LiteLLM's.env
file.TOKEN_SECRET_KEY
is a secret key for generating access tokens to the websocket endpoint. You should not reveal this key to a client.API_KEY
is the HTTPS API access key. You need to share it with a client.
Voi relies on Whisper for speech transcribition and adds realtime (transcribe-as-you-speak) processing on top of that. The model weights are downloaded automatically on the first launch.
Voi uses xTTS-v2 model to generate speech. It gives the best tradeoff between the quality and speed.
To test your agents, you can download the pre-trained multi-speaker model from HuggingFace. Download these files and put them in a directory of your choice (f.e., models/xtts_v2
):
model.pth
config.json
vocab.json
speakers_xtts.pth
Then make a copy of tts_models.example.json
and fix the paths in multispeaker_original
so that they point to the model files above.
cp tts_models.example.json tts_models.json
Voi allows changing voice tone of the agent dynamically during the conversation (like neutral or excited), but the pre-trained model coming along with xTTS doesn't allow this. I have a custom pipeline for fine-tuning text-to-speech models on audio datasets and enabling dynamic tone changing, which I'm not open sourcing today. If you need a custom model, please DM me on X.
Agents are defined in JSON files in the agents
directory. The control agents are defined in agents/control_agents.json
. To add a new agent, simply create a JSON file with agent configurations in the agents
directory and it will be loaded when the server starts. A client can also send an agent config when opening a new connection using the agent_config
field.
An example of agent configurations can be found in the voi-js-client repository.
Each agent configuration has the following structure:
llm_model
: The language model to use (must match models inlitellm_config.yaml
)control_agent
(optional): Name of an agent that filters/controls the main agent's responsesvoices
: Configuration for speech synthesischaracter
: Main voice settingsmodel
: TTS model name fromtts_models.json
voice
: Voice identifier for the modelspeed
(optional): Speech speed multiplier
narrator
(optional): Voice for narrative comments- Same settings as
character
plus: leading_silence
: Silence before narrationtrailing_silence
: Silence after narration
- Same settings as
system_prompt
: Array of strings defining the agent's personality and behavior. Can include special templates:{character_agent_message_format_voice_tone}
: Adds instructions for voice tone control (neutral/warm/excited/sad){character_agent_message_format_narrator_comments}
: Adds instructions for narrator comments format (actions in third person)
examples
(optional): List of conversation examples for few-shot learninggreetings
: Initial messages configurationchoices
: List of greeting messages (can include pre-cached voice files)voice_tone
: Emotional tone for greeting (must match tones intts_models.json
)
Special agents like control_agent
can have additional fields:
model
: Processing type (e.g. "pattern_matching")denial_phrases
: Phrases to filter outgiveup_after
: Number of retries before giving upgiveup_response
: Fallback responses
Ssh into the container and run:
python3 ws_server.py
Note that the first time the client connects to an agent it may take some time to load the text-to-speech models.
Clients and agents communicate via the websocket. A client must receive it's personal token to access the websocket endpoint. This can be made in two ways:
- Through the API:
curl -I -X POST "https://your_host_address:port/integrations/your_app" \
-H "API-Key: your_api_key"
where
your_host_address:port
is the address of the host running Voi server andport
is the port where the server is listening.your_app
is the name of your app, likerelationships_coach
.your_api_key
isAPI_KEY
from.env
.
Note that this will generate a token which will expire after 1 day.
- Manually:
python3 token_generator.py your_app --expire n_days
Here you can set any number of n_days
when your token will expire.
- Make it open source
- Incoming calls
- Context gathering: understand user's problem
- Function calling: add external actuators like DB inquiry
- Turn detection: detect a moment when agent can start to speak
- Add a call-center-like voice
- WebRTC support
- VoIP support
- Outcoming calls
Realtime conversation with a human is a really complex task, as it requires from the agent an empathy, competence and speed. If you lack a single piece of these, your agent is useless. That's why making a good voice agent is not just stacking a bunch of APIs together. You have to develop it very carefully, making a small step, then testing, making a small step, then testing...
There are two main factors which enabled me to run this project. First, the emergence of smart, fast and cheap LLMs necessary for agents intelligence. Second, the advancement of code copilots. Though I have a deep learning background, there are lots of topics beyond my competence required to build a good voice agent.
While open sourcing Voi, I realized many people could use it for learning software engineering. Yes, this is still actual, because this project is basically many pieces of AI-generated code carefully stitched together by a human engineer.
You are welcome to open PRs with bug fixes, new features and documentation improvements.
Or you can just buy me a coffee and I will convert it to code!
Voi uses the MIT license, which basically means you can do anything with it, free of charge. However, the dependencies may have different licenses. Check them if you care.