Skip to content

A production-ready voice agent implementation using LiveKit and Python, featuring advanced conversational AI capabilities and optional telephony integration. It provides intelligent turn detection, function calling, comprehensive logging, telephony integration, and audio enhancement.

License

Notifications You must be signed in to change notification settings

danieladdisonorg/livekit-voice-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LiveKit Voice Agent

A production-ready voice agent implementation using LiveKit and Python, featuring advanced conversational AI capabilities and optional telephony integration.

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           LiveKit Voice Agent Architecture                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Web Client    β”‚    β”‚  Phone System   β”‚    β”‚  Mobile App     β”‚
β”‚   (Next.js)     β”‚    β”‚   (Twilio)      β”‚    β”‚   (React)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                      β”‚                      β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚     LiveKit Server      β”‚
                    β”‚   (WebRTC Gateway)      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Voice Pipeline Agent  β”‚
                    β”‚                         β”‚
                    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
                    β”‚  β”‚ Turn Detection  β”‚   β”‚
                    β”‚  β”‚   (Silero)      β”‚   β”‚
                    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
                    β”‚                         β”‚
                    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
                    β”‚  β”‚ Audio Pipeline  β”‚   β”‚
                    β”‚  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚
                    β”‚  β”‚ β”‚   Krisp     β”‚ β”‚   β”‚
                    β”‚  β”‚ β”‚ (Noise      β”‚ β”‚   β”‚
                    β”‚  β”‚ β”‚ Cancel)     β”‚ β”‚   β”‚
                    β”‚  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚
                    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                        β”‚                        β”‚
        β–Ό                        β–Ό                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Speech-to-   β”‚        β”‚ Language     β”‚        β”‚ Text-to-     β”‚
β”‚ Text (STT)   β”‚        β”‚ Model (LLM)  β”‚        β”‚ Speech (TTS) β”‚
β”‚              β”‚        β”‚              β”‚        β”‚              β”‚
β”‚ Deepgram API β”‚        β”‚  OpenAI API  β”‚        β”‚ ElevenLabs   β”‚
β”‚              β”‚        β”‚              β”‚        β”‚ API          β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                       β”‚                       β”‚
       β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
       β”‚              β”‚ Function Calling β”‚              β”‚
       β”‚              β”‚                  β”‚              β”‚
       β”‚              β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚              β”‚
       β”‚              β”‚ β”‚   Weather    β”‚ β”‚              β”‚
       β”‚              β”‚ β”‚   Service    β”‚ β”‚              β”‚
       β”‚              β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚              β”‚
       β”‚              β”‚                  β”‚              β”‚
       β”‚              β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚              β”‚
       β”‚              β”‚ β”‚   Clock      β”‚ β”‚              β”‚
       β”‚              β”‚ β”‚   Service    β”‚ β”‚              β”‚
       β”‚              β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚              β”‚
       β”‚              β”‚                  β”‚              β”‚
       β”‚              β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚              β”‚
       β”‚              β”‚ β”‚   Custom     β”‚ β”‚              β”‚
       β”‚              β”‚ β”‚   Tools      β”‚ β”‚              β”‚
       β”‚              β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚              β”‚
       β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
       β”‚                                               β”‚
       └───────────────────┐     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚     β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                    β”‚   Logging &       β”‚
                    β”‚   Analytics       β”‚
                    β”‚                   β”‚
                    β”‚ β€’ Usage Metrics   β”‚
                    β”‚ β€’ Conversation    β”‚
                    β”‚   Summaries       β”‚
                    β”‚ β€’ Performance     β”‚
                    β”‚   Monitoring      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              Data Flow Process                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Audio Input β†’ 2. Noise Cancellation β†’ 3. Speech Detection β†’ 4. STT Processing
                                                    ↓
8. Audio Output ← 7. TTS Generation ← 6. Response Generation ← 5. LLM Processing
                                                    ↓
                                            Function Execution
                                          (Weather, Clock, etc.)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                            Telephony Integration                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phone Call β†’ Twilio SIP β†’ LiveKit SIP Gateway β†’ Voice Agent β†’ Response Pipeline
     ↑                                                              ↓
     └──────────────── Audio Response β†β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Regional SIP Configuration:
β€’ US East: 54.172.60.0, 54.244.51.0
β€’ US West: 54.171.127.192, 35.156.191.128  
β€’ Europe: 54.171.127.200, 35.156.191.140
β€’ Asia Pacific: 54.169.127.128, 52.65.191.64

How It Works

1. Connection Establishment

  • Users connect via web browsers, mobile apps, or phone calls
  • LiveKit server handles WebRTC connections and SIP integration
  • Agent automatically detects connection type and optimizes accordingly

2. Audio Processing Pipeline

  • Input: Raw audio from user's microphone or phone
  • Noise Cancellation: Krisp AI removes background noise
  • Turn Detection: Silero VAD detects when user starts/stops speaking
  • Speech-to-Text: Deepgram converts speech to text in real-time

3. Intelligent Processing

  • Language Understanding: OpenAI processes user intent
  • Function Calling: Agent can execute tools (weather, time, custom functions)
  • Context Management: Maintains conversation history and state

4. Response Generation

  • Text Generation: LLM creates appropriate responses
  • Text-to-Speech: ElevenLabs converts text to natural speech
  • Audio Delivery: Processed audio sent back to user

5. Monitoring & Analytics

  • Real-time performance metrics
  • Conversation logging and summaries
  • Usage analytics and optimization insights

Features

  • Intelligent Turn Detection - Natural conversation flow with automatic speech detection
  • Function Calling - Extensible tool integration including:
    • Weather information retrieval
    • Real-time clock functionality
  • Comprehensive Logging - Usage analytics and conversation summaries
  • Telephony Integration - Inbound call support via Twilio SIP trunking
  • Audio Enhancement - Krisp noise cancellation for crystal-clear communication
  • Optimized Models - Automatic model switching for telephony vs. web-based interactions

Prerequisites

  • Python 3.8 or higher
  • LiveKit Cloud account or self-hosted LiveKit server
  • API keys for required services (OpenAI, ElevenLabs, Deepgram)
  • Optional: Twilio account for telephony features

Installation

Quick Start

  1. Clone and navigate to the repository:
git clone https://github.com/danieladdisonorg/livekit-voice-agent.git
cd livekit-voice-agent
  1. Set up Python environment:

Linux/macOS:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 agent.py download-files

Windows:

python3 -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python3 agent.py download-files

Configuration

  1. Environment Setup: Copy the example environment file and configure your API credentials:
cp .env.example .env.local
  1. Required Environment Variables:

    LIVEKIT_URL=your_livekit_server_url
    LIVEKIT_API_KEY=your_api_key
    LIVEKIT_API_SECRET=your_api_secret
    OPENAI_API_KEY=your_openai_key
    ELEVEN_API_KEY=your_elevenlabs_key
    DEEPGRAM_API_KEY=your_deepgram_key
    
  2. Automated Configuration (Optional): If using LiveKit Cloud, you can auto-configure using the CLI:

lk app env

Usage

Development Mode

Start the agent in development mode:

python3 agent.py dev

Frontend Integration

This agent requires a compatible frontend application. We recommend using the LiveKit Next.js Voice Agent Interface for a complete solution.

Telephony Integration (Optional)

Enable inbound phone calls through Twilio SIP integration.

Prerequisites

  • LiveKit CLI installed and authenticated
  • Twilio account with phone number
  • SIP trunk configuration

Installation Steps

  1. Install LiveKit CLI (macOS):
brew update && brew install livekit-cli
  1. Authenticate with LiveKit Cloud:
lk cloud auth

Twilio Configuration

  1. Create Twilio Resources:

    • Sign up for a Twilio account
    • Purchase a phone number
    • Create a new SIP trunk in the Twilio Console
  2. Configure SIP Trunk:

    • Navigate to: Elastic SIP Trunking β†’ SIP Trunks β†’ Create
    • Add Origination URI: <YOUR_LIVEKIT_SIP_URI>;transport=tcp
    • Associate your phone number with priority 1, weight 1
  3. Deploy LiveKit SIP Configuration:

    Create Inbound Trunk:

lk sip inbound create inbound-trunk.json

Create Dispatch Rule:

lk sip dispatch create dispatch-rule.json

Regional Configuration

Update inbound-trunk.json with appropriate Twilio SIP signaling IP addresses for your region. The default configuration includes US IP addresses.

Architecture

  • Agent Core - Main conversation logic and state management
  • Function Registry - Extensible tool calling system
  • Audio Pipeline - Real-time audio processing with noise cancellation
  • SIP Integration - Telephony gateway for inbound calls
  • Logging System - Comprehensive usage and performance analytics

Support

For issues and questions:

  • Check the LiveKit Documentation
  • Review existing GitHub issues
  • Contact support through your LiveKit Cloud dashboard

About

A production-ready voice agent implementation using LiveKit and Python, featuring advanced conversational AI capabilities and optional telephony integration. It provides intelligent turn detection, function calling, comprehensive logging, telephony integration, and audio enhancement.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages