A comprehensive audio transcription system that automatically adapts to hardware capabilities, featuring speaker diarization, real-time monitoring, and advanced analytics.
Smart Whisper Transcriber is an advanced implementation combining OpenAI's Whisper model with speaker diarization capabilities. The system automatically detects your hardware configuration, selects the most appropriate processing approach, and provides detailed real-time analytics of the transcription process.
graph TB
subgraph Input
A[Audio Files] --> B[Audio Preprocessor]
B --> C[Audio Normalizer]
end
subgraph Core Processing
D[Hardware Manager] --> E[Resource Monitor]
F[Transcription Engine] --> G[Speaker Diarization]
E --> F
C --> F
end
subgraph Output Generation
G --> H[SRT Generator]
G --> I[TXT Generator]
G --> J[JSON Generator]
end
subgraph Monitoring
E --> K[Performance Analyzer]
K --> L[Reports Generator]
K --> M[Real-time Monitor]
end
graph LR
subgraph Input Data
A[Raw Audio] --> B[Preprocessed Audio]
B --> C[Normalized Audio]
end
subgraph Processing Data
C --> D[Audio Segments]
D --> E[Transcribed Text]
E --> F[Speaker Labels]
end
subgraph Output Data
F --> G[SRT Files]
F --> H[TXT Files]
F --> I[JSON Data]
F --> J[Analytics Reports]
end
subgraph Metadata
K[Hardware Stats] --> J
L[Performance Metrics] --> J
M[System Config] --> J
end
graph TB
subgraph User Interface
A[CLI Interface] --> D[Input Handler]
B[API Endpoint] --> D
C[Monitoring UI] --> E[Monitor Handler]
end
subgraph Core Functions
D --> F[Audio Processor]
F --> G[Transcription Manager]
G --> H[Diarization Engine]
H --> I[Output Generator]
end
subgraph System Services
E --> J[Resource Monitor]
J --> K[Performance Analyzer]
K --> L[Report Generator]
end
subgraph Hardware Layer
M[Hardware Manager] --> N[GPU Manager]
M --> O[CPU Manager]
M --> P[Memory Manager]
end
Use the following command to view the project structure:
tree . -I "venv|__pycache__"
Expected structure:
.
├── Dockerfile
├── Dockerfile.monitoring
├── Readme.MD
├── audio_files/
│ └── .gitkeep
├── data/
│ └── .gitkeep
├── docker-compose.yml
├── main.py
├── models/
│ └── .gitkeep
├── modules/
│ ├── __init__.py
│ ├── apiStructure.py
│ ├── hardwareManager.py
│ ├── monitoring.py
│ ├── systemConfig.py
│ └── transcriptionEngine.py
├── output/
│ ├── reports/
│ └── .gitkeep
└── requirements.txt
- 🗣️ Speaker diarization for multi-speaker transcription
- 🧠 Intelligent hardware detection and optimization
- 🚀 Multi-device support (NVIDIA, AMD, Intel, Apple Silicon)
- 📊 Real-time performance monitoring and analytics
- 🌡️ Hardware health monitoring (temperature, power, memory)
- 💪 Automatic failover and fallback mechanisms
- 🔄 Support for multiple Whisper models
- 🌐 RESTful API with webhook support
- 📈 Detailed HTML performance reports
- 👥 Speaker separation in outputs
- 📝 Multiple output formats (SRT, TXT, JSON)
- 📁 Batch processing capabilities
- 📈 Progress tracking and ETA estimation
- 🔄 Automatic audio preprocessing
- 🎯 Optimal model selection based on hardware
- 📊 Real-time resource utilization displays
- 📈 Performance analytics and reporting
- 🌡️ Temperature and power monitoring
- 💾 Memory usage tracking
- 📉 Processing speed analysis
- Python 3.10 or higher
- FFmpeg installed on your system
- CUDA-compatible GPU (optional)
- HuggingFace account for speaker diarization
# Clone the repository
git clone https://github.com/yourusername/smart-whisper-transcriber.git
cd smart-whisper-transcriber
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
.\venv\Scripts\activate # Windows
# Install required packages
pip install -r requirements.txt
For speaker diarization functionality, you need to authenticate with HuggingFace:
-
Create a HuggingFace account at https://huggingface.co
-
Generate an access token at https://huggingface.co/settings/tokens
-
Set up authentication in one of two ways:
a. Using environment variables:
# Create .env file echo "HUGGINGFACE_TOKEN=your_token_here" > .env
b. Using local token file:
# Create token file mkdir -p ~/.huggingface echo "your_token_here" > ~/.huggingface/token
# Basic usage
python main.py audio_files/your_audio.wav
# With specific model selection
python main.py --model medium audio_files/your_audio.mp3
# Start API server
python main.py --server
The system exposes a RESTful API on port 8000:
# Start transcription job
curl -X POST http://localhost:8000/transcribe \
-H "Content-Type: application/json" \
-d '{
"input_url": "https://example.com/audio.mp3",
"config": {
"model_size": "medium",
"language": "en"
}
}'
The system generates multiple output files:
output/
├── [filename]_[timestamp]/
│ ├── filename.json # Complete transcription data
│ ├── filename.txt # Combined transcript
│ ├── filename_male_speaker.srt # Male speaker subtitles
│ ├── filename_male_speaker.txt # Male speaker transcript
│ ├── filename_female_speaker.srt # Female speaker subtitles
│ └── filename_female_speaker.txt # Female speaker transcript
└── reports/
├── processing_report.json
├── report.html
├── resource_plot.png
└── speed_plot.png
- Speaker Diarization Authentication:
- Error: "No Hugging Face token found in environment"
- Solution: Ensure either
.env
file or~/.huggingface/token
is properly set up
- CUDA Issues:
- Error: "Insufficient VRAM"
- Solution: Try using a smaller model size or CPU fallback
- FFmpeg Missing:
- Error: "FFmpeg not found"
- Solution: Install FFmpeg using your system's package manager
Contributions are welcome! Please feel free to submit a Pull Request. When contributing, please:
- Fork the repository
- Create a new branch for your feature
- Add appropriate tests
- Update documentation as needed
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
If you encounter any issues or have questions:
- Check the Issues page
- Review the documentation
- Create a new issue if needed
- OpenAI for the Whisper model
- HuggingFace for speaker diarization models
- PyTorch team for the deep learning framework
- FFmpeg project for audio processing capabilities
Made with ❤️ by Yassine Boumiza