A high-performance text-to-speech system built around the Kokoro TTS engine. This system maintains persistent TTS models to avoid reloading between invocations, significantly improving efficiency for repeated use. It also will interrupt a voice if it's the same voice. For example Jack and Jill can both talk at the same time, but Jill can only talk once at a time.
- Persistent daemon maintains models in memory for fast synthesis
- Multiple voices with automatic downloading
- Multiple languages support (US English, UK English)
- Two interfaces: CLI tool (
say.py
) and REST API (api.py
) - Efficient socket communication between client and server
- File output support for saving speech as WAV files
- Speed control for speech rate adjustment
- Python 3.8+
- Required packages:
kokoro-onnx
,sounddevice
,soundfile
,numpy
,requests
,fastapi
,uvicorn
- Clone the repository:
git clone https://github.com/iplayfast/kokoro-say.git cd kokoro-say
- create an environment:
python -m venv venv source venv/bin/activate
- Install the dependencies:
pip install -r requirements.txt
The say.py
script provides a convenient command-line interface for text-to-speech synthesis.
# Basic usage with default voice
python say.py "Hello world"
# Specify a voice by name
python say.py --voice af_bella "Hello world"
# Specify a voice by number
python say.py --voice 1 "Hello world"
# Adjust speech speed
python say.py --speed 1.2 "Hello world"
# Save to WAV file
python say.py --output hello.wav "Hello world"
# Use pipe input
echo "Hello world" | python say.py -
# Interactive mode for multiple inputs
python say.py --interactive
# List available voices and languages
python say.py --list
# Update available voices list
python say.py --update-voices
# Kill running servers
python say.py --kill
# Set log level
python say.py --log-level DEBUG "Hello world"
Argument | Description |
---|---|
--voice |
Voice to use (name or number) |
--lang |
Language code or number |
--speed |
Speech speed multiplier (default: 1.0) |
--output |
Save audio to specified WAV file |
--interactive |
Read lines from stdin and speak each one |
--list |
List available voices and languages |
--update-voices |
Force update of available voices |
--kill |
Send kill command to servers |
--log-level |
Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) |
The api.py
script provides a RESTful API for text-to-speech synthesis, ideal for integrating TTS capabilities into web applications or services.
# Start API server on default port (8000)
python api.py
# Specify host and port
python api.py --host 0.0.0.0 --port 8080
# Enable auto-reload for development
python api.py --reload
Endpoint | Method | Description |
---|---|---|
/ |
GET | Get system information |
/health |
GET | Health check |
/voices |
GET | Get list of available voices |
/languages |
GET | Get list of available languages |
/synthesize |
POST | Synthesize speech (audio plays on server) |
/synthesize-file |
POST | Synthesize speech and return WAV file |
curl http://localhost:8000/
curl http://localhost:8000/voices
curl http://localhost:8000/languages
curl -X POST http://localhost:8000/synthesize \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "voice": "af_bella", "language": "en-us", "speed": 1.0}'
curl -X POST http://localhost:8000/synthesize-file \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "voice": "af_bella", "language": "en-us", "speed": 1.0}' \
--output speech.wav
For /synthesize
and /synthesize-file
endpoints:
Parameter | Type | Description | Default |
---|---|---|---|
text |
string | Text to synthesize | (required) |
voice |
string | Voice to use | "af_bella" |
language |
string | Language code | "en-us" |
speed |
float | Speech speed multiplier | 1.0 |
The system consists of several components:
- Model Server: Central server that manages the TTS model
- Voice Servers: Per-voice servers that handle synthesis for specific voices
- CLI Client: Command-line interface (
say.py
) - API Server: RESTful API interface (
api.py
) - Shared TTS Client Library: Common functionality (
tts_client.py
)
Communication between components happens via Unix domain sockets and TCP sockets for efficient IPC.
The system automatically downloads voices as needed. Available voices include:
- af_heart, af_bella, af_nicole, af_sarah, af_sky (female)
- am_adam, am_michael (male)
- bf_emma, bf_isabella (female)
- bm_george, bm_lewis (male)
Configuration constants are defined in src/constants.py
and can be modified if needed:
- Server host and port
- Socket paths
- Cache directories
- Audio processing parameters
- Logging options
- Server won't start: Check log file at
/tmp/tts_daemon.log
- No audio: Ensure your system audio is working properly
- Voice not found: Run
python say.py --update-voices
- Hanging processes: Run
python say.py --kill
to terminate all servers
- Based on the Kokoro TTS engine
- Voices from Kokoro-82M