A local implementation of the Kokoro Text-to-Speech model, featuring dynamic module loading, automatic dependency management, and a web interface.
- Local text-to-speech synthesis using the Kokoro-82M model
- Multiple voice support with easy voice selection (54 voices available across 8 languages)
- Automatic model and voice downloading from Hugging Face
- Phoneme output support and visualization
- Interactive CLI and web interface
- Voice listing functionality
- Cross-platform support (Windows, Linux, macOS)
- Real-time generation progress display
- Multiple output formats (WAV, MP3, AAC)
- Python 3.8 or higher
- FFmpeg (optional, for MP3/AAC conversion)
- CUDA-compatible GPU (optional, for faster generation)
- Git (for version control and package management)
- Clone the repository and create a Python virtual environment:
# Windows
python -m venv venv
.\venv\Scripts\activate
# Linux/macOS
python3 -m venv venv
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
Alternative Installation (Simplified): For a simpler setup, you can also install the official Kokoro package directly:
pip install kokoro>=0.9.2 soundfile
apt-get install espeak-ng # On Linux
# or brew install espeak # On macOS
- (Optional) For GPU acceleration, install PyTorch with CUDA support:
# For CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# For CUDA 12.6
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# For CUDA 12.8 (for RTX 50-series cards)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
You can verify CUDA support is enabled with:
import torch
print(torch.cuda.is_available()) # Should print True if CUDA is available
The system will automatically download required models and voice files on first run.
You can use either the command-line interface or the web interface:
Run the interactive CLI:
python tts_demo.py
The CLI provides an interactive menu with the following options:
- List available voices - Shows all available voice options
- Generate speech - Interactive process to:
- Select a voice from the numbered list
- Enter text to convert to speech
- Adjust speech speed (0.5-2.0)
- Exit - Quit the program
Example session:
=== Kokoro TTS Menu ===
1. List available voices
2. Generate speech
3. Exit
Select an option (1-3): 2
Available voices:
1. af_alloy
2. af_aoede
3. af_bella
...
Select a voice number (or press Enter for default 'af_bella'): 3
Enter the text you want to convert to speech
(or press Enter for default text)
> Hello, world!
Enter speech speed (0.5-2.0, default 1.0): 1.2
Generating speech for: 'Hello, world!'
Using voice: af_bella
Speed: 1.2x
...
For a more user-friendly experience, launch the web interface:
python gradio_interface.py
Then open your browser to the URL shown in the console (typically http://localhost:7860).
The web interface provides:
- Easy voice selection from a dropdown menu
- Text input field with examples
- Real-time generation progress
- Audio playback in the browser
- Multiple output format options (WAV, MP3, AAC)
- Download options for generated audio
The system includes 54 different voices across 8 languages:
Language code: 'a'
Female voices (af_*):
- af_heart: โค๏ธ Premium quality voice (Grade A)
- af_alloy: Clear and professional (Grade C)
- af_aoede: Smooth and melodic (Grade C+)
- af_bella: ๐ฅ Warm and friendly (Grade A-)
- af_jessica: Natural and engaging (Grade D)
- af_kore: Bright and energetic (Grade C+)
- af_nicole: ๐ง Professional and articulate (Grade B-)
- af_nova: Modern and dynamic (Grade C)
- af_river: Soft and flowing (Grade D)
- af_sarah: Casual and approachable (Grade C+)
- af_sky: Light and airy (Grade C-)
Male voices (am_*):
- am_adam: Strong and confident (Grade F+)
- am_echo: Resonant and clear (Grade D)
- am_eric: Professional and authoritative (Grade D)
- am_fenrir: Deep and powerful (Grade C+)
- am_liam: Friendly and conversational (Grade D)
- am_michael: Warm and trustworthy (Grade C+)
- am_onyx: Rich and sophisticated (Grade D)
- am_puck: Playful and energetic (Grade C+)
- am_santa: Holiday-themed voice (Grade D-)
Language code: 'b'
Female voices (bf_*):
- bf_alice: Refined and elegant (Grade D)
- bf_emma: Warm and professional (Grade B-)
- bf_isabella: Sophisticated and clear (Grade C)
- bf_lily: Sweet and gentle (Grade D)
Male voices (bm_*):
- bm_daniel: Polished and professional (Grade D)
- bm_fable: Storytelling and engaging (Grade C)
- bm_george: Classic British accent (Grade C)
- bm_lewis: Modern British accent (Grade D+)
Language code: 'j'
Female voices (jf_*):
- jf_alpha: Standard Japanese female (Grade C+)
- jf_gongitsune: Based on classic tale (Grade C)
- jf_nezumi: Mouse bride tale voice (Grade C-)
- jf_tebukuro: Glove story voice (Grade C)
Male voices (jm_*):
- jm_kumo: Spider thread tale voice (Grade C-)
Language code: 'z'
Female voices (zf_*):
- zf_xiaobei: Chinese female voice (Grade D)
- zf_xiaoni: Chinese female voice (Grade D)
- zf_xiaoxiao: Chinese female voice (Grade D)
- zf_xiaoyi: Chinese female voice (Grade D)
Male voices (zm_*):
- zm_yunjian: Chinese male voice (Grade D)
- zm_yunxi: Chinese male voice (Grade D)
- zm_yunxia: Chinese male voice (Grade D)
- zm_yunyang: Chinese male voice (Grade D)
Language code: 'e'
Female voices (ef_*):
- ef_dora: Spanish female voice
Male voices (em_*):
- em_alex: Spanish male voice
- em_santa: Spanish holiday voice
Language code: 'f'
Female voices (ff_*):
- ff_siwis: French female voice (Grade B-)
Language code: 'h'
Female voices (hf_*):
- hf_alpha: Hindi female voice (Grade C)
- hf_beta: Hindi female voice (Grade C)
Male voices (hm_*):
- hm_omega: Hindi male voice (Grade C)
- hm_psi: Hindi male voice (Grade C)
Language code: 'i'
Female voices (if_*):
- if_sara: Italian female voice (Grade C)
Male voices (im_*):
- im_nicola: Italian male voice (Grade C)
Language code: 'p'
Female voices (pf_*):
- pf_dora: Portuguese female voice
Male voices (pm_*):
- pm_alex: Portuguese male voice
- pm_santa: Portuguese holiday voice
Note: Quality grades (A to F) indicate the overall quality based on training data quality and duration. Higher grades generally produce better speech quality.
.
โโโ .cache/ # Cache directory for downloaded models
โ โโโ huggingface/ # Hugging Face model cache
โโโ .git/ # Git repository data
โโโ .gitignore # Git ignore rules
โโโ __pycache__/ # Python cache files
โโโ voices/ # Voice model files (downloaded on demand)
โ โโโ *.pt # Individual voice files
โโโ venv/ # Python virtual environment
โโโ outputs/ # Generated audio files directory
โโโ LICENSE # Apache 2.0 License file
โโโ README.md # Project documentation
โโโ models.py # Core TTS model implementation
โโโ gradio_interface.py # Web interface implementation
โโโ config.json # Model configuration file
โโโ requirements.txt # Python dependencies
โโโ tts_demo.py # CLI implementation
The project uses the latest Kokoro model from Hugging Face:
- Repository: hexgrad/Kokoro-82M
- Model file:
kokoro-v1_0.pth
(downloaded automatically) - Sample rate: 24kHz
- Voice files: Located in the
voices/
directory (downloaded automatically) - Available voices: 54 voices across 8 languages
- Languages: American English ('a'), British English ('b'), Japanese ('j'), Mandarin Chinese ('z'), Spanish ('e'), French ('f'), Hindi ('h'), Italian ('i'), Brazilian Portuguese ('p')
- Model size: 82M parameters
Common issues and solutions:
-
Model Download Issues
- Ensure stable internet connection
- Check Hugging Face is accessible
- Verify sufficient disk space
- Try clearing the
.cache/huggingface
directory
-
CUDA/GPU Issues
- Verify CUDA installation with
nvidia-smi
- Update GPU drivers
- Install PyTorch with CUDA support using the appropriate command:
# For CUDA 11.8 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # For CUDA 12.1 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # For CUDA 12.6 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126 # For CUDA 12.8 (for RTX 50-series cards) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
- Verify CUDA is available in PyTorch:
import torch print(torch.cuda.is_available()) # Should print True
- Fall back to CPU if needed
- Verify CUDA installation with
-
Audio Output Issues
- Check system audio settings
- Verify output directory permissions
- Install FFmpeg for MP3/AAC support
- Try different output formats
-
Voice File Issues
- Delete and let system redownload voice files
- Check
voices/
directory permissions - Verify voice file integrity
- Try using a different voice
-
Web Interface Issues
- Check port 7860 availability
- Try different browser
- Clear browser cache
- Check network firewall settings
For any other issues:
- Check the console output for error messages
- Verify all prerequisites are installed
- Ensure virtual environment is activated
- Check system resource usage
- Try reinstalling dependencies
Feel free to contribute by:
- Opening issues for bugs or feature requests
- Submitting pull requests with improvements
- Helping with documentation
- Testing different voices and reporting issues
- Suggesting new features or optimizations
- Testing on different platforms and reporting results
Apache 2.0 - See LICENSE file for details