voice-fn - Real-time Voice AI Pipeline Framework

voice-fn is a Clojure framework for building real-time voice AI applications using a data-driven, functional approach. Built on top of clojure.core.async.flow, it provides a composable pipeline architecture for processing audio, text, and AI interactions with built-in support for major AI providers.

This project's status is experimental. Expect breaking changes.

Core Features

Flow-Based Architecture: Built on core.async.flow for robust concurrent processing
Data-First Design: Define AI pipelines as data structures for easy configuration and modification
Streaming Architecture: Efficient real-time audio and text processing
Extensible Processors: Simple protocol-based system for adding new processing components
Flexible Frame System: Type-safe message passing between pipeline components
Built-in Services: Ready-to-use integrations with major AI providers

Quick Start: Local example

First, create a resources/secrets.edn:

{:deepgram {:api-key ""}
 :elevenlabs {:api-key ""
              :voice-id ""}
 :groq {:api-key ""}
 :openai {:new-api-sk ""}}

Obtain the API keys from the respective providers and fill in the blank values.

Start a REPL and evaluate the snippets in the (comment ...) blocks to start the flows. Allow Microphone access when prompted.

(ns voice-fn-examples.local
  (:require
   [clojure.core.async :as a]
   [clojure.core.async.flow :as flow]
   [taoensso.telemere :as t]
   [voice-fn.processors.deepgram :as asr]
   [voice-fn.processors.elevenlabs :as tts]
   [voice-fn.processors.llm-context-aggregator :as context]
   [voice-fn.processors.openai :as llm]
   [voice-fn.secrets :refer [secret]]
   [voice-fn.transport :as transport]
   [voice-fn.utils.core :as u]))

(defn make-local-flow
  "This example showcases a voice AI agent for the local computer.  Audio is
  usually encoded as PCM at 16kHz frequency (sample rate) and it is mono (1
  channel).
    "
  ([] (make-local-flow {}))
  ([{:keys [llm-context extra-procs extra-conns encoding debug?
            sample-rate language sample-size-bits channels chunk-duration-ms]
     :or {llm-context {:messages [{:role "system"
                                   :content "You are a helpful assistant "}]}
          encoding :pcm-signed
          sample-rate 16000
          sample-size-bits 16
          channels 1
          chunk-duration-ms 20
          language :en
          debug? false
          extra-procs {}
          extra-conns []}}]

   (flow/create-flow
     {:procs
      (u/deep-merge
        {;; Capture audio from microphone and send raw-audio-input frames further in the pipeline
         :transport-in {:proc transport/microphone-transport-in
                        :args {:audio-in/sample-rate sample-rate
                               :audio-in/channels channels
                               :audio-in/sample-size-bits sample-size-bits}}
         ;; raw-audio-input -> transcription frames
         :transcriptor {:proc asr/deepgram-processor
                        :args {:transcription/api-key (secret [:deepgram :api-key])
                               :transcription/interim-results? true
                               :transcription/punctuate? false
                               :transcription/vad-events? true
                               :transcription/smart-format? true
                               :transcription/model :nova-2
                               :transcription/utterance-end-ms 1000
                               :transcription/language language
                               :transcription/encoding encoding
                               :transcription/sample-rate sample-rate}}

         ;; user transcription & llm message frames -> llm-context frames
         ;; responsible for keeping the full conversation history
         :context-aggregator  {:proc context/context-aggregator
                               :args {:llm/context llm-context
                                      :aggregator/debug? debug?}}

         ;; Takes llm-context frames and produces new llm-text-chunk & llm-tool-call-chunk frames
         :llm {:proc llm/openai-llm-process
               :args {:openai/api-key (secret [:openai :new-api-sk])
                      :llm/model "gpt-4o-mini"}}

         ;; llm-text-chunk & llm-tool-call-chunk -> llm-context-messages-append frames
         :assistant-context-assembler {:proc context/assistant-context-assembler
                                       :args {:debug? debug?}}

         ;; llm-text-chunk -> sentence speak frames (faster for text to speech)
         :llm-sentence-assembler {:proc context/llm-sentence-assembler}

         ;; speak-frames -> audio-output-raw frames
         :tts {:proc tts/elevenlabs-tts-process
               :args {:elevenlabs/api-key (secret [:elevenlabs :api-key])
                      :elevenlabs/model-id "eleven_flash_v2_5"
                      :elevenlabs/voice-id (secret [:elevenlabs :voice-id])
                      :voice/stability 0.5
                      :voice/similarity-boost 0.8
                      :voice/use-speaker-boost? true
                      :flow/language language
                      :audio.out/encoding encoding
                      :audio.out/sample-rate sample-rate}}

         ;; audio-output-raw -> smaller audio-output-raw frames (used for sending audio in realtime)
         :audio-splitter {:proc transport/audio-splitter
                          :args {:audio.out/sample-rate sample-rate
                                 :audio.out/sample-size-bits sample-size-bits
                                 :audio.out/channels channels
                                 :audio.out/duration-ms chunk-duration-ms}}

         ;; speakers out
         :transport-out {:proc transport/realtime-speakers-out-processor
                         :args {:audio.out/sample-rate sample-rate
                                :audio.out/sample-size-bits sample-size-bits
                                :audio.out/channels channels
                                :audio.out/duration-ms chunk-duration-ms}}}
        extra-procs)
      :conns (concat
               [[[:transport-in :out] [:transcriptor :in]]

                [[:transcriptor :out] [:context-aggregator :in]]
                [[:context-aggregator :out] [:llm :in]]

                ;; Aggregate full context
                [[:llm :out] [:assistant-context-assembler :in]]
                [[:assistant-context-assembler :out] [:context-aggregator :in]]

                ;; Assemble sentence by sentence for fast speech
                [[:llm :out] [:llm-sentence-assembler :in]]
                [[:llm-sentence-assembler :out] [:tts :in]]

                [[:tts :out] [:audio-splitter :in]]
                [[:audio-splitter :out] [:transport-out :in]]]
               extra-conns)})))

(def local-ai (make-local-flow))

(comment

  ;; Start local ai flow - starts paused
  (let [{:keys [report-chan error-chan]} (flow/start local-ai)]
    (a/go-loop []
      (when-let [[msg c] (a/alts! [report-chan error-chan])]
        (when (map? msg)
          (t/log! {:level :debug :id (if (= c error-chan) :error :report)} msg))
        (recur))))

  ;; Resume local ai -> you can now speak with the AI
  (flow/resume local-ai)

  ;; Stop the conversation
  (flow/stop local-ai)

  ,)

Which roughly translates to:

See examples for more usages.

Supported Providers

Text-to-Speech (TTS)

ElevenLabs
- Models: eleven_multilingual_v2, eleven_turbo_v2, eleven_flash_v2 and more.
- Features: Real-time streaming, multiple voices, multilingual support

Speech-to-Text (STT)

Deepgram
- Models: nova-2, nova-2-general, nova-2-meeting and more.
- Features: Real-time transcription, punctuation, smart formatting

Text Based Large Language Models (LLM)

OpenAI
- Models: gpt-4o-mini(fastest, cheapest), gpt-4, gpt-3.5-turbo and more
- Features: Function calling, streaming responses

Key Concepts

Flows

The core building block of voice-fn pipelines:

Composed of processes connected by channels
Processes can be:
- Input/output handlers
- AI service integrations
- Data transformers
Managed by core.async.flow for lifecycle control

Transport

The modality through which audio comes and goes from the voice ai pipeline. Example transport modalities:

local (microphone + speakers)
telephony (twilio through websocket)
webRTC (browser support) - TODO
async (through in & out core async channels)

You will see processors like :transport-in & :transport-out

Frames

The basic unit of data flow, representing typed messages like:

:audio/input-raw - Raw audio data
:transcription/result - Transcribed text
:llm/text-chunk - LLM response chunks
:system/start, :system/stop - Control signals

Each frame has a type and optionally a schema for the data contained in it.

See frame.clj for all possible frames.

Processes

Components that transform frames:

Define input/output requirements
Can maintain state
Use core.async for async processing
Implement the flow/process protocol

Adding Custom Processes

    (defn custom-processor []
      (flow/process
        {:describe (fn [] {:ins {:in "Input channel"}
                           :outs {:out "Output channel"}})
         :init identity
         :transform (fn [state in msg]
                      [state {:out [(process-message msg)]}])}))

Read core.async.flow docs for more information about flow precesses.

Built With

core.async - Concurrent processing
core.async.flow - Flow control
Hato - WebSocket support
Malli - Schema validation

Acknowledgements

Voice-fn takes heavy inspiration from pipecat. Differences:

voice-fn uses a graph instead of a bidirectional queue for frame transport
voice-fn has a data centric implementation. The processors in voice-fn are pure functions in the core.async.flow transform syntax

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 222 Commits
.clj-kondo		.clj-kondo
.github/workflows		.github/workflows
dev		dev
doc		doc
examples		examples
resources		resources
src/voice_fn		src/voice_fn
test/voice_fn		test/voice_fn
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
TODO.org		TODO.org
TODO.org_archive		TODO.org_archive
architecture.org		architecture.org
bb.edn		bb.edn
build.clj		build.clj
deps.edn		deps.edn
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

voice-fn - Real-time Voice AI Pipeline Framework

Core Features

Quick Start: Local example

Supported Providers

Text-to-Speech (TTS)

Speech-to-Text (STT)

Text Based Large Language Models (LLM)

Key Concepts

Flows

Transport

Frames

Processes

Adding Custom Processes

Built With

Acknowledgements

License

About

Releases

Packages

Languages

License

plurigrid/voice-fn

Folders and files

Latest commit

History

Repository files navigation

voice-fn - Real-time Voice AI Pipeline Framework

Core Features

Quick Start: Local example

Supported Providers

Text-to-Speech (TTS)

Speech-to-Text (STT)

Text Based Large Language Models (LLM)

Key Concepts

Flows

Transport

Frames

Processes

Adding Custom Processes

Built With

Acknowledgements

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages