OpenAPI-like voice generation server based on fish-speech-1.5.
Supports text-to-speech
and voice style transfer via reference audio samples.
- Nvidia GPU
- For Docker-way
- Nvidia Docker Runtime
- Docker
- Docker Compose
- For Manual Setup
- Python 3.12
- Python Venv
Clone the repo first:
git clone --recurse-submodules git@github.com:EvilFreelancer/fish-speech-api.git
cd docker-fish-speech-server
cp docker-compose.dist.yml docker-compose.yml
docker compose up -d
Enter the container:
docker compose exec api bash
Download the model:
huggingface-cli download fishaudio/fish-speech-1.5 --local-dir models/fish-speech-1.5/
apt install cmake portaudio19-dev
Set up a virtual environment and install dependencies:
python3.12 -m venv venv
pip install -r requirements.txt
Download model:
huggingface-cli download fishaudio/fish-speech-1.5 --local-dir models/fish-speech-1.5/
Run API-server:
python main.py
curl http://localhost:8000/audio/speech \
-X POST \
-F model="fish-speech-1.5" \
-F input="Hello, this is a test of Fish Speech API" \
--output "speech.wav"
In JSON format:
curl http://localhost:8000/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "fish-speech-1.5",
"input": "Hello, this is a test of Fish Speech API"
}' \
--output "speech.wav"
curl http://gpu02:13000/audio/speech \
-X POST \
-F model="fish-speech-1.5" \
-F voice="english-nice" \
-F input="Dr. Eleanor Whitaker, a quantum physicist from Edinburgh, surreptitiously analyzed the enigmatic hieroglyphs while humming Für Elise —her quizzical expression mirrored the cryptic symbols perplexing arrangement, yet she remained determined to decipher their archaic secrets." \
--output "speech.wav"
In JSON format:
curl http://localhost:8000/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "fish-speech-1.5",
"voice": "english-nice",
"input": "Dr. Eleanor Whitaker, a quantum physicist from Edinburgh, surreptitiously analyzed the enigmatic hieroglyphs while humming Für Elise —her quizzical expression mirrored the cryptic symbols perplexing arrangement, yet she remained determined to decipher their archaic secrets."
}' \
--output "speech.wav"
curl http://localhost:8000/audio/speech \
-X POST \
-H 'Content-Type: multipart/form-data' \
-F model="fish-speech-1.5" \
-F input="Dr. Eleanor Whitaker, a quantum physicist from Edinburgh, surreptitiously analyzed the enigmatic hieroglyphs while humming Für Elise —her quizzical expression mirrored the cryptic symbols perplexing arrangement, yet she remained determined to decipher their archaic secrets." \
-F reference_audio="@voice-viola.wav" \
--output "speech.wav"
In JSON format:
curl http://localhost:8000/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "fish-speech-1.5",
"input": "Dr. Eleanor Whitaker, a quantum physicist from Edinburgh, surreptitiously analyzed the enigmatic hieroglyphs while humming Für Elise —her quizzical expression mirrored the cryptic symbols perplexing arrangement, yet she remained determined to decipher their archaic secrets.",
"reference_audio": "=base64..."
}' \
--output "speech.wav"
curl http://localhost:8000/audio/speech \
-X POST \
-H 'Content-Type: multipart/form-data' \
-F model="fish-speech-1.5" \
-F input="Dr. Eleanor Whitaker, a quantum physicist from Edinburgh, surreptitiously analyzed the enigmatic hieroglyphs while humming Für Elise —her quizzical expression mirrored the cryptic symbols perplexing arrangement, yet she remained determined to decipher their archaic secrets." \
-F top_p="0.1" \
-F repetition_penalty="1.3" \
-F temperature="0.75" \
-F chunk_length="150" \
-F max_new_tokens="768" \
-F seed="42" \
-F reference_audio="@voice-viola.wav" \
--output "speech.wav"
In JSON format:
curl http://localhost:8000/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "fish-speech-1.5",
"input": "Dr. Eleanor Whitaker, a quantum physicist from Edinburgh, surreptitiously analyzed the enigmatic hieroglyphs while humming Für Elise —her quizzical expression mirrored the cryptic symbols perplexing arrangement, yet she remained determined to decipher their archaic secrets.",
"top_p": "0.1",
"repetition_penalty": "1.3",
"temperature": "0.75",
"chunk_length": "150",
"max_new_tokens": "768",
"seed": "42",
"reference_audio": "=base64..."
}' \
--output "speech.wav"