Build your own AI voice assistant that can handle inbound calls using OpenAI's GPT for conversation, Deepgram for speech processing, and Twilio for telephony - all for around 1 cent per minute!
This repository is Part 1 of a series demonstrating how to build production-ready AI voice assistants. In this first part, we focus on handling inbound calls and basic FAQ responses, achieving:
- ~1 second latency
- ~$0.01 per minute cost
- Natural conversation flow with interruption handling
- Scalable architecture for future expansion
Coming in Part 2 (stay tuned!):
- Function calling capabilities
- Outbound call handling
- Enhanced text-to-speech with 11.labs
- And more!
Key Components:
- Twilio: Handles inbound calls and audio streaming
- Deepgram:
- Speech-to-Text: Real-time transcription
- Text-to-Speech: Response generation
- OpenAI GPT: Natural language processing and response generation
- WebSocket Server: Real-time audio streaming and service orchestration
The system is built with a modular architecture:
app.js
: Main server and WebSocket handlingservices/
:gpt-service.js
: OpenAI integration and conversation managementstream-service.js
: Audio streaming and buffer managementtranscription-service.js
: Speech-to-text processingtts-service.js
: Text-to-speech conversion
- Node.js (v14+)
- npm/yarn
- Accounts with:
- Twilio
- Deepgram
- OpenAI
- Clone the repository:
git clone https://github.com/Barty-Bart/ai-voice-assistant-openai-deepgram.git
cd ai-voice-assistant-openai-deepgram
- Install dependencies:
npm install
- Create
.env
file:
SERVER=your-server-domain
DEEPGRAM_API_KEY=your-deepgram-api-key
VOICE_MODEL=your-preferred-voice-model
OPENAI_API_KEY=your-openai-api-key
- Configure Twilio:
- Set up a Twilio phone number
- Configure webhook to point to your
/incoming
endpoint - Ensure your server has HTTPS (required for Twilio)
- Start the server:
npm start
-
Call Initiation:
- Customer calls Twilio number
- Twilio establishes WebSocket connection with server
-
Real-time Processing:
- Speech-to-Text: Customer audio → Deepgram → Text
- Processing: Text → OpenAI GPT → Response
- Text-to-Speech: Response → Deepgram → Audio
- Audio streamed back to caller
-
Key Features:
- Real-time transcription and response
- Natural conversation handling
- Interruption detection
- Ordered message queuing
- Implement streaming TTS API from Deepgram for reduced latency
- Integrate Elevenlabs for enhanced voice quality
- Add outbound calling capabilities
- Implement function calling for complex tasks
- Add more sophisticated conversation handling
- Cost: Approximately 1 cent per minute
- Significantly lower than commercial alternatives ($5-10 cents/min)
- Latency: ~1 second response time
- Can be further optimized with streaming TTS
Contributions are welcome! Please feel free to submit a Pull Request.
This project was built based on Twilio's Call-GPT.
ai voice assistant, openai gpt, deepgram, twilio, voice ai, chatbot, conversational ai, speech recognition, text to speech, websocket, nodejs, real-time audio, low-cost ai, inbound calls
Star ⭐ this repository if you find it helpful!