Real-time conversational AI with voice cloning and emotion detection. Analyses conversation context to deliver dramatically expressive responses using your cloned voice. Built with FastRTC and Chatterbox TTS for natural, emotionally-aware voice interactions.
- 🎭 Voice Cloning: Use any voice from a single reference audio file
- 🎯 Natural Emotion Detection: Analyses conversation context to detect emotions automatically
- 🎪 Dramatic Expression: Dynamic voice synthesis with exaggeration, temperature, and cfg_weight adjustments
- ⚡ Real-time Streaming: Low-latency audio generation and playback
- 💬 Dual Interface: WebSocket text chat and Gradio voice chat
- 🧠 Smart Context: Maintains conversation history with emotional awareness
- 🎵 12 Set Emotions: Excited, happy, sad, angry, surprised, confused, tired, worried, calm, frustrated, enthusiastic, neutral
- Python 3.10+
- CUDA-compatible GPU (RTX 4090 recommended for real-time performance)
- Ollama with Gemma 3 4B model
-
Clone the repository
git clone https://github.com/dwain-barnes/chatterbox-fastrtc-realtime-emotion.git cd chatterbox-fastrtc-realtime-emotion
-
Install PyTorch for your system
# For CUDA 11.8 (check pytorch.org for your specific setup) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-
Install requirements
pip install -r requirements.txt
-
Install Chatterbox TTS (avoiding numpy conflicts)
# Important: Install without dependencies to avoid numpy==1.26.0 conflicts pip install --no-deps chatterbox-tts
-
Install and run Ollama with Gemma 3 4B
# Install Ollama from https://ollama.ai ollama pull gemma3:4b:latest ollama serve
-
Add your voice reference (optional)
# Place your reference voice file in the project directory cp /path/to/your/voice.wav reference_voice.wav
python realtime_emotion.py
- Text Chat: http://localhost:8000/
- Voice Chat: http://localhost:8000/gradio
- Record a 10-30 second clear audio sample of the target voice
- Save it as
reference_voice.wav
in the project directory - Restart the application
- The cloned voice will be used for all emotional responses
Each emotion uses carefully tuned parameters for dramatic expression:
- Exaggeration: 0.05 (tired) to 0.95 (excited)
- CFG Weight: 0.2 (angry) to 0.95 (tired)
- Temperature: 0.3 (tired) to 1.3 (excited)
- Recommended: RTX 4090 GPU for real-time generation
- Minimum: RTX 3070 or equivalent
- Model: Gemma 3 4B for optimal speed/quality balance
- RAM: 16GB+ recommended
- Frontend: FastAPI + WebSocket + HTML/CSS/JS
- Voice Interface: Gradio + FastRTC
- TTS: Chatterbox TTS with voice cloning
- STT: FastRTC STT model
- LLM: Ollama (Gemma 3 4B)
- Emotion Detection: Context-based pattern matching
- Input Processing: Text or voice input is received
- LLM Response: Gemma 3 generates contextual response
- Emotion Detection: Analyses response text for emotional patterns
- Voice Synthesis: Applies dramatic parameters based on detected emotion
- Real-time Streaming: Audio chunks streamed as they're generated
- Playback: Client receives and plays audio with minimal latency
Modify EMOTION_PARAMETERS
in the code to adjust emotional expression:
"excited": {
"exaggeration": 0.95, # Higher = more expressive
"cfg_weight": 0.2, # Lower = more variation
"temperature": 1.3 # Higher = more dynamic
}
- Change LLM model in the
init_chat_model
call - Adjust chunk duration for latency vs quality trade-offs
- Modify sample rates for different audio quality
Key dependencies include:
fastapi
- Web frameworkfastrtc
- Real-time communicationchatterbox-tts
- Voice synthesis and cloninglangchain
- LLM integrationgradio
- Voice interfacetorch
- Deep learning frameworknumpy
- Numerical computing
- Fork the repository
- Create a feature branch
- Make your changes
- Test with different emotions and voices
- Submit a pull request
MIT License - see LICENSE file for details.
- Chatterbox TTS Streaming for TTS
- FastRTC for real-time communication
- Ollama for local LLM serving
Experience emotional conversations with your own cloned voice! 🎭🎤