A FastAPI-based REST API wrapper for the Fish-Speech voice cloning model. This API allows you to clone voices and generate new speech with custom text using the Fish-Speech model.
This project is a REST API wrapper built around the Fish-Speech model. All credit for the underlying voice cloning technology goes to:
- Original Repository: Fish-Speech
- License: Original License
- Authors: Fish-Speech Team
- Upload reference audio file
- Provide reference and target text
- Generate cloned voice speaking the target text
- CPU/CUDA support
- Python 3.8+
- FastAPI
- PyTorch
- Fish-Speech model checkpoints
- Clone the repository:
git clone [your-repo-url]
cd voice-cloning-app
- Create and activate virtual environment:
python -m venv venv
# On Windows
.\venv\Scripts\activate
# On Linux/Mac
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
- Download the Fish-Speech model:
huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5
- Start the FastAPI server:
uvicorn app.main:app --reload
- Access the API documentation at http://localhost:8000/docs
- Upload Reference Audio and Generate New Speech Access the
- Swagger UI at http://localhost:8000/docs:
- Audio File: Your reference voice recording
- Reference Text: The exact words spoken in your reference audio
- Target Text: The new text you want to generate
- Audio File: your_voice.wav
- Reference Text: "Welcome to the podcast. Let’s dive into today’s topic." (reference text must match exactly what's said in the input audio)
- Target Text: "Today's episode will focus on AI and its impact on society."
- Successful response will look like
{
"status": "success",
"message": "Voice cloning successful",
"output_path": "fake.wav"
}
voice-cloning-app/
├── app/
│ ├── main.py # FastAPI application
│ └── services/
│ ├── voice_clone.py # Voice cloning service
│ └── audio_service.py # Audio handling service
├── tools/ # Fish-Speech inference tools
├── checkpoints/ # Model checkpoints
└── uploads/ # Temporary upload directory
- The reference text must match exactly what is being said in the input audio file
- Input audio should be clear and around 10 seconds long for best results
- The API currently saves the output as 'fake.wav' in the project directory