Skip to content

Real-time conversational AI with voice cloning and emotion detection. Analyzes conversation context to deliver dramatically expressive responses using your cloned voice. Built with FastRTC and Chatterbox TTS for natural, emotionally-aware voice interactions.

Notifications You must be signed in to change notification settings

dwain-barnes/chatterbox-fastrtc-realtime-emotion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Chatterbox FastRTC Realtime Emotion (Local)

Real-time conversational AI with voice cloning and emotion detection. Analyses conversation context to deliver dramatically expressive responses using your cloned voice. Built with FastRTC and Chatterbox TTS for natural, emotionally-aware voice interactions.

✨ Features

  • 🎭 Voice Cloning: Use any voice from a single reference audio file
  • 🎯 Natural Emotion Detection: Analyses conversation context to detect emotions automatically
  • 🎪 Dramatic Expression: Dynamic voice synthesis with exaggeration, temperature, and cfg_weight adjustments
  • Real-time Streaming: Low-latency audio generation and playback
  • 💬 Dual Interface: WebSocket text chat and Gradio voice chat
  • 🧠 Smart Context: Maintains conversation history with emotional awareness
  • 🎵 12 Set Emotions: Excited, happy, sad, angry, surprised, confused, tired, worried, calm, frustrated, enthusiastic, neutral

YouTube Demo: Watch the video

🛠️ Installation

Prerequisites

  • Python 3.10+
  • CUDA-compatible GPU (RTX 4090 recommended for real-time performance)
  • Ollama with Gemma 3 4B model

Setup

  1. Clone the repository

    git clone https://github.com/dwain-barnes/chatterbox-fastrtc-realtime-emotion.git
    cd chatterbox-fastrtc-realtime-emotion
  2. Install PyTorch for your system

    # For CUDA 11.8 (check pytorch.org for your specific setup)
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
  3. Install requirements

    pip install -r requirements.txt
  4. Install Chatterbox TTS (avoiding numpy conflicts)

    # Important: Install without dependencies to avoid numpy==1.26.0 conflicts
    pip install --no-deps chatterbox-tts
  5. Install and run Ollama with Gemma 3 4B

    # Install Ollama from https://ollama.ai
    ollama pull gemma3:4b:latest
    ollama serve
  6. Add your voice reference (optional)

    # Place your reference voice file in the project directory
    cp /path/to/your/voice.wav reference_voice.wav

🎮 Usage

Start the Application

python realtime_emotion.py

Access the Interfaces

Voice Cloning Setup

  1. Record a 10-30 second clear audio sample of the target voice
  2. Save it as reference_voice.wav in the project directory
  3. Restart the application
  4. The cloned voice will be used for all emotional responses

⚙️ Technical Details

Emotion Parameters

Each emotion uses carefully tuned parameters for dramatic expression:

  • Exaggeration: 0.05 (tired) to 0.95 (excited)
  • CFG Weight: 0.2 (angry) to 0.95 (tired)
  • Temperature: 0.3 (tired) to 1.3 (excited)

Performance Requirements

  • Recommended: RTX 4090 GPU for real-time generation
  • Minimum: RTX 3070 or equivalent
  • Model: Gemma 3 4B for optimal speed/quality balance
  • RAM: 16GB+ recommended

Architecture

  • Frontend: FastAPI + WebSocket + HTML/CSS/JS
  • Voice Interface: Gradio + FastRTC
  • TTS: Chatterbox TTS with voice cloning
  • STT: FastRTC STT model
  • LLM: Ollama (Gemma 3 4B)
  • Emotion Detection: Context-based pattern matching

🎯 How It Works

  1. Input Processing: Text or voice input is received
  2. LLM Response: Gemma 3 generates contextual response
  3. Emotion Detection: Analyses response text for emotional patterns
  4. Voice Synthesis: Applies dramatic parameters based on detected emotion
  5. Real-time Streaming: Audio chunks streamed as they're generated
  6. Playback: Client receives and plays audio with minimal latency

🔧 Configuration

Emotion Tuning

Modify EMOTION_PARAMETERS in the code to adjust emotional expression:

"excited": {
    "exaggeration": 0.95,    # Higher = more expressive
    "cfg_weight": 0.2,       # Lower = more variation
    "temperature": 1.3       # Higher = more dynamic
}

Model Settings

  • Change LLM model in the init_chat_model call
  • Adjust chunk duration for latency vs quality trade-offs
  • Modify sample rates for different audio quality

📝 Requirements

Key dependencies include:

  • fastapi - Web framework
  • fastrtc - Real-time communication
  • chatterbox-tts - Voice synthesis and cloning
  • langchain - LLM integration
  • gradio - Voice interface
  • torch - Deep learning framework
  • numpy - Numerical computing

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test with different emotions and voices
  5. Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Experience emotional conversations with your own cloned voice! 🎭🎤

About

Real-time conversational AI with voice cloning and emotion detection. Analyzes conversation context to deliver dramatically expressive responses using your cloned voice. Built with FastRTC and Chatterbox TTS for natural, emotionally-aware voice interactions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages