A real-time audio transcription and diarization system that can capture both system audio and microphone input simultaneously.
- Real-time audio transcription using OpenAI's Whisper large-v3 model
- Speaker diarization using pyannote.audio 3.1
- Automatic summarization using Llama 3.2 (via Ollama)
- Supports multiple audio sources simultaneously (e.g., system audio + microphone)
- Real-time display of transcriptions with speaker identification
- Exports transcripts in markdown format with timestamps
- Progress tracking and status updates during processing
- Python 3.10 (required for compatibility with pyannote.audio)
- NVIDIA GPU with CUDA support (recommended)
- PipeWire audio system (for Linux)
- Node.js and npm (for frontend)
- Hugging Face account and API token (for diarization model)
- Ollama with Llama 3.2 model (for summarization)
- Install Ollama and Llama 3.2:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull Llama 2 model
ollama pull llama3.2
- Create and activate a Python virtual environment:
python3.10 -m venv venv
source venv/bin/activate
- Install PyTorch and torchaudio:
pip install --upgrade pip wheel setuptools
pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install torchaudio --index-url https://download.pytorch.org/whl/cu118
- Install Python dependencies:
pip install Cython
pip install "numpy>=1.22,<1.24"
pip install -r backend/requirements.txt
- Install frontend dependencies:
cd frontend
npm install
- Create a
.env
file in the root directory:
HUGGINGFACE_TOKEN=your_token_here
OLLAMA_HOST=http://localhost:11434 # Ollama API endpoint
-
Hugging Face Setup:
- Go to https://huggingface.co/settings/tokens
- Create a new token with read access
- Copy the token and paste it in your
.env
file
-
Accept terms for these models:
-
Verify Ollama setup:
# Check Ollama is running
curl http://localhost:11434/api/tags
# Verify Llama 3.2 is available
ollama list
- Start Ollama (if not running):
systemctl start ollama
- Start the backend:
cd backend
PYTHONPATH=. uvicorn src.api.main:app --reload
- Start the frontend (in a new terminal):
cd frontend
npm start
- Open http://localhost:3000 in your browser
-
Select audio sources:
- Choose system audio source(s) to capture desktop audio
- Choose microphone input to capture your voice
- You can select multiple sources to capture simultaneously
-
Click "Start Recording" to begin transcription
- The system will record audio from selected sources
- Audio is saved as a WAV file in the recordings directory
-
Click "Stop Recording" when finished
- The system will process the recording
- Status updates show progress
- Transcription appears with speaker identification
- A summary is generated using Llama 2
- Files are saved in the transcripts directory
The system generates several files:
recordings/recording_[timestamp].wav
- The recorded audio filetranscripts/transcript_[timestamp].md
- The transcript with:- Speaker identification
- Timestamps
- Full transcript
- Generated summary
- Multiple source recording
- Proper audio mixing
- 16-bit WAV format
- Automatic gain control
- Using Whisper large-v3 model
- Word-level timestamps
- High accuracy for multiple languages
- Optimized for GPU processing
- Using pyannote.audio 3.1
- Advanced speaker separation
- Handles multiple speakers
- Optimized clustering parameters
- Using Llama 3.2 via Ollama
- Context-aware summaries
- Maintains speaker attribution
- Handles long conversations
- Real-time status updates
- Clear speaker identification
- Timestamp display
- Processing progress indicators
- Device selection interface
-
No system audio sources available:
- Make sure PipeWire is running:
systemctl --user status pipewire
- Check available sources:
pw-cli list-objects | grep -A 3 "Monitor"
- Make sure PipeWire is running:
-
GPU memory errors:
- Free up GPU memory by closing other applications
- Monitor GPU usage with
nvidia-smi
-
Installation errors:
- Make sure you're using Python 3.10
- Install PyTorch before other dependencies
- Check CUDA compatibility with
nvidia-smi
-
Audio quality issues:
- Check input device levels
- Verify proper device selection
- Monitor audio peaks during recording
-
Summarization issues:
- Verify Ollama is running:
systemctl status ollama
- Check Llama 3.2 model is installed:
ollama list
- Verify API endpoint in .env file
- Verify Ollama is running:
If you encounter issues:
- Check the console output for error messages
- Look for similar issues in the project's issue tracker
- Include relevant error messages and system information when reporting issues
[Insert License Information]
This project uses:
- OpenAI Whisper for transcription
- Pyannote Audio for speaker diarization
- Faster Whisper for optimized inference
- CTC Forced Aligner for timestamp alignment
- Ollama for running Llama 2
- Llama 2 for summarization