This step is intentionally not incorporated in the Dockerfile because at the active development stage you often change the requirements and don't want to rebuild the container each time. You won't need to do this each time when the container was restarted if you have mapped .local directory properly in the docker-compose-dev.yml.

Caddy

The Voi server uses secure web socket connection and relies on Caddy, which nicely manages SSL certificates for us. Follow the docs to get it.

On your host machine, make sure you have proper config in the Caddyfile (usually /etc/caddy/Caddyfile):

your_domain.com:8774 {
    reverse_proxy localhost:8775
}

This will proxy secure web socket connection from 8775 to 8774 port.

LiteLLM

LiteLLM allows calling all LLM APIs using OpenAI format, which is neat.

Restricted regions

If you run a Voi server in a country restricted by OpenAI (like Russia or China), you will need to run a remote LiteLLM server in a closest unrestricted country. You can do this for just $10/mo using AWS Lightsail. These are minimal specs you need:

2 GB RAM, 2 vCPUs, 60 GB SSD
Ubuntu

If you use AWS Lightsail, do not forget to add a custom TCP rule for the port 4000.

If you are not in the restricted region, you can run LiteLLM server locally on your host machine.

Setup

For the details of setting up LiteLLM, visit the repo, but basically you need to follow these steps.

Get the code.

git clone https://github.com/BerriAI/litellm
cd litellm

Add the master key - you can change this after setup.

echo 'LITELLM_MASTER_KEY="sk-1234"' > .env
source .env

Create models configuration file.

vim litellm_config.yaml

Example configuration:

model_list:
  - model_name: gemini-1.5-flash
    litellm_params:
      model: openai/gemini-1.5-flash
      api_key: your_googleapi_key
      api_base: https://generativelanguage.googleapis.com/v1beta/openai
  - model_name: meta-llama-3.1-70b-instruct-turbo
    litellm_params:
      model: openai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
      api_key: your_deepinfra_key
      api_base: https://api.deepinfra.com/v1/openai

The model format is {API format}/{model name}, where {API format} is openai/anthropic and {model name} is the model name in the provider's format (gpt-4o-mini for OpenAI or meta-llama/Meta-Llama-3.1-8B-Instruct for DeepInfra). Look at litellm_config.example.yaml for more examples.

Start LiteLLM server.

docker-compose up

Voi server

Before running the server, we need to set the environment variables and create agents config.

My typical workflow is to run the development environment and ssh to the container using Cursor (Connect to Host -> voi_docker_dev). In this way, I can edit source code and run the scripts in one place.

Environment variables

Make a copy of .env.example.

# Assuming you are in the Voi root
cp .env.example .env

LITELLM_API_BASE is the address of your LiteLLM server, like http://111.1.1.1:4000 of http://localhost:4000.
LITELLM_API_KEY is LITELLM_MASTER_KEY from the LiteLLM's .env file.
TOKEN_SECRET_KEY is a secret key for generating access tokens to the websocket endpoint. You should not reveal this key to a client.
API_KEY is the HTTPS API access key. You need to share it with a client.

Speech-to-text model

Voi relies on Whisper for speech transcribition and adds realtime (transcribe-as-you-speak) processing on top of that. The model weights are downloaded automatically on the first launch.

Text-to-speech models

Voi uses xTTS-v2 model to generate speech. It gives the best tradeoff between the quality and speed.

To test your agents, you can download the pre-trained multi-speaker model from HuggingFace. Download these files and put them in a directory of your choice (f.e., models/xtts_v2):

model.pth
config.json
vocab.json
speakers_xtts.pth

Then make a copy of tts_models.example.json and fix the paths in multispeaker_original so that they point to the model files above.

cp tts_models.example.json tts_models.json

Custom text-to-speech models

Voi allows changing voice tone of the agent dynamically during the conversation (like neutral or excited), but the pre-trained model coming along with xTTS doesn't allow this. I have a custom pipeline for fine-tuning text-to-speech models on audio datasets and enabling dynamic tone changing, which I'm not open sourcing today. If you need a custom model, please DM me on X.

Agents

Agents are defined in JSON files in the agents directory. The control agents are defined in agents/control_agents.json. To add a new agent, simply create a JSON file with agent configurations in the agents directory and it will be loaded when the server starts. A client can also send an agent config when opening a new connection using the agent_config field.

An example of agent configurations can be found in the voi-js-client repository.

Each agent configuration has the following structure:

llm_model: The language model to use (must match models in litellm_config.yaml)
control_agent (optional): Name of an agent that filters/controls the main agent's responses
voices: Configuration for speech synthesis
- character: Main voice settings
  - model: TTS model name from tts_models.json
  - voice: Voice identifier for the model
  - speed (optional): Speech speed multiplier
- narrator (optional): Voice for narrative comments
  - Same settings as character plus:
  - leading_silence: Silence before narration
  - trailing_silence: Silence after narration
system_prompt: Array of strings defining the agent's personality and behavior. Can include special templates:
- {character_agent_message_format_voice_tone}: Adds instructions for voice tone control (neutral/warm/excited/sad)
- {character_agent_message_format_narrator_comments}: Adds instructions for narrator comments format (actions in third person)
examples (optional): List of conversation examples for few-shot learning
greetings: Initial messages configuration
- choices: List of greeting messages (can include pre-cached voice files)
- voice_tone: Emotional tone for greeting (must match tones in tts_models.json)

Special agents like control_agent can have additional fields:

model: Processing type (e.g. "pattern_matching")
denial_phrases: Phrases to filter out
giveup_after: Number of retries before giving up
giveup_response: Fallback responses

Run the server

Ssh into the container and run:

python3 ws_server.py

Note that the first time the client connects to an agent it may take some time to load the text-to-speech models.

Access tokens

Clients and agents communicate via the websocket. A client must receive it's personal token to access the websocket endpoint. This can be made in two ways:

Through the API:

curl -I -X POST "https://your_host_address:port/integrations/your_app" \
-H "API-Key: your_api_key"

where

your_host_address:port is the address of the host running Voi server and port is the port where the server is listening.
your_app is the name of your app, like relationships_coach.
your_api_key is API_KEY from .env.

Note that this will generate a token which will expire after 1 day.

Manually:

python3 token_generator.py your_app --expire n_days

Here you can set any number of n_days when your token will expire.

Roadmap

Make it open source
Incoming calls
- Context gathering: understand user's problem
- Function calling: add external actuators like DB inquiry
- Turn detection: detect a moment when agent can start to speak
- Add a call-center-like voice
WebRTC support
VoIP support
Outcoming calls

Philosophy

Realtime conversation with a human is a really complex task, as it requires from the agent an empathy, competence and speed. If you lack a single piece of these, your agent is useless. That's why making a good voice agent is not just stacking a bunch of APIs together. You have to develop it very carefully, making a small step, then testing, making a small step, then testing...

There are two main factors which enabled me to run this project. First, the emergence of smart, fast and cheap LLMs necessary for agents intelligence. Second, the advancement of code copilots. Though I have a deep learning background, there are lots of topics beyond my competence required to build a good voice agent.

While open sourcing Voi, I realized many people could use it for learning software engineering. Yes, this is still actual, because this project is basically many pieces of AI-generated code carefully stitched together by a human engineer.

Contribution

You are welcome to open PRs with bug fixes, new features and documentation improvements.

Or you can just buy me a coffee and I will convert it to code!

License

Voi uses the MIT license, which basically means you can do anything with it, free of charge. However, the dependencies may have different licenses. Check them if you care.

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
agents		agents
docker		docker
notebooks		notebooks
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
audio.py		audio.py
backup.py		backup.py
conversation.py		conversation.py
generation.py		generation.py
install.sh		install.sh
litellm_config.example.yaml		litellm_config.example.yaml
llm.py		llm.py
recognition.py		recognition.py
requirements.txt		requirements.txt
requirements_nodep.txt		requirements_nodep.txt
text.py		text.py
token_generator.py		token_generator.py
tts_models.example.json		tts_models.example.json
ws_server.py		ws_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Requirements

Hardware

Software

Setup

Development environment

Caddy

LiteLLM

Restricted regions

Setup

Voi server

Environment variables

Speech-to-text model

Text-to-speech models

Custom text-to-speech models

Agents

Run the server

Access tokens

Roadmap

Philosophy

Contribution

License

About

Releases

Packages

Languages

License

alievk/voi-server

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Requirements

Hardware

Software

Setup

Development environment

Caddy

LiteLLM

Restricted regions

Setup

Voi server

Environment variables

Speech-to-text model

Text-to-speech models

Custom text-to-speech models

Agents

Run the server

Access tokens

Roadmap

Philosophy

Contribution

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages