A sophisticated voice-based AI assistant that operates entirely locally, integrating speech-to-text, text-to-speech, and large language model capabilities without relying on cloud services.
- Speech Recognition: Powered by OpenAI's Whisper model for accurate speech-to-text conversion
- Voice Synthesis: Implements Coqui TTS for natural-sounding text-to-speech responses
- Language Processing: Connects with Ollama to run large language models locally
- User Interface: Features an intuitive Gradio-based interface
- Python 3.12 or newer
- 8GB RAM minimum (16GB recommended)
- Storage: 2GB for base models
- NVIDIA GPU recommended for optimal performance
Create and activate a Python virtual environment:
# Windows
python -m venv venv
.\venv\Scripts\activate
# macOS/Linux
python -m venv venv
source venv/bin/activate
# Ubuntu
python -m venv venv
source venv/bin/activate
# Core dependencies
pip install -U openai-whisper coqui-tts sounddevice soundfile gradio
# For NVIDIA GPU acceleration
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Ensure Ollama is installed on your system for local LLM functionality.
# Pull a recommended model
ollama pull gemma3:1b
Launch the application:
python main.py
The Gradio interface will start locally and can be accessed via your web browser.
- Offline Speech Recognition: Transcribe voice input without internet connectivity
- Natural Voice Output: Generate human-like speech responses
- Voice Customization: Multiple voice options available through Coqui TTS models
- Contextual Understanding: Maintains conversation history for coherent interactions
- Local Processing: All data remains on your device for enhanced privacy
- Extensible Architecture: Easily integrate additional models or functionality
assistant-example.mp4
For common issues, see our troubleshooting guide.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.