🎙️ AgentVox

Edge-based voice assistant using Gemma LLM with real-time Speech-to-Text and Text-to-Speech capabilities

🆕 What's New (v0.2.0)

Real-time Streaming STT/TTS: Replaced traditional STT/TTS with RealtimeSTT and RealtimeTTS for lower latency
Live Transcription: See what you're saying in real-time as you speak
Streaming TTS: Faster response times with streaming audio synthesis
Improved Performance: Better voice activity detection and faster processing
Configurable TTS Speed: Adjust speech speed with --tts-speed parameter

Key Features

Real-time Speech Recognition (STT): Live transcription using RealtimeSTT with Whisper
Conversational AI (LLM): Local LLM based on Llama.cpp (Gemma 3 12B)
Streaming Speech Synthesis (TTS): Real-time voice synthesis with RealtimeTTS using Coqui XTTS v2
Complete Offline Operation: All processing is done locally, ensuring privacy
Voice Cloning: Clone any voice with a short audio sample

🤝 Collaboration with STOIC Entertainment

AgentVox is proud to be developed in collaboration with STOIC Entertainment, a pioneering tech company specializing in AI and XR cross-platform solutions. This partnership integrates STOIC's expertise in creating immersive, interactive experiences with AgentVox's advanced voice technology.

STOIC's key contributions include:

Advanced Dialogue Management: Leveraging their patented Query Doc technology, which enhances the AI's ability to have context-aware conversations by using an efficient, on-device RAG (Retrieval-Augmented Generation) system. This allows for more natural, memory-driven interactions, making the AI feel more like a personal companion that remembers past conversations.
Cross-Platform XR Integration: STOIC's extensive experience in developing content for Mobile, PC, and XR platforms ensures that AgentVox can be seamlessly integrated into various immersive environments, including upcoming AR glasses from major tech companies.
AI-Powered Content: Insights from STOIC's work on projects like the AI language learning platform Fluent help shape the conversational agent's ability to engage users in meaningful, goal-oriented dialogue.

Installation

1. Install via pip

pip install agentvox

Or install from source:

git clone https://github.com/yourusername/agentvox.git
cd agentvox
pip install -e .

For NVIDIA CUDA Users

If you have an NVIDIA GPU and want to use CUDA acceleration, you need to rebuild llama-cpp-python with CUDA support:

# Rebuild llama-cpp-python with CUDA support
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

This will significantly improve LLM inference performance on NVIDIA GPUs.

2. Download Model

# Automatically download Gemma model (~7GB)
agentvox --download-model

The model will be saved in ~/.agentvox/models/ directory.

Usage

Basic Usage

# Start voice conversation
agentvox

Speak into your microphone and the AI will respond with voice.

Voice Configuration

# Voice cloning with Coqui-TTS (default engine)
agentvox --speaker-wav speaker_sample.wav

# Record your own voice sample for cloning
agentvox --record-speaker
# Then use the recorded sample
agentvox --speaker-wav speaker_ko.wav

# Adjust TTS speed (1.0 is normal, higher is faster)
agentvox --tts-speed 1.5  # 50% faster
agentvox --tts-speed 1.3  # 30% faster (default)
agentvox --tts-speed 1.0  # Normal speed
agentvox --tts-speed 0.8  # 20% slower

Advanced Configuration

STT (Speech Recognition) Parameters

# Recognize speech in different languages
agentvox --stt-language en

# Increase beam size for more accurate recognition (default: 5)
agentvox --stt-beam-size 10

# Adjust VAD sensitivity (default: 0.5)
agentvox --stt-vad-threshold 0.3

# Adjust minimum speech duration in ms (default: 250)
agentvox --stt-vad-min-speech-duration 200

# Adjust minimum silence duration in ms (default: 1000)
agentvox --stt-vad-min-silence-duration 800

# Change Whisper model size (tiny, base, small, medium, large)
agentvox --stt-model small

LLM (Language Model) Parameters

# Generate longer responses (default: 512)
agentvox --llm-max-tokens 1024

# More creative responses (higher temperature, default: 0.7)
agentvox --llm-temperature 0.9

# More conservative responses (lower temperature)
agentvox --llm-temperature 0.3

# Adjust context size (default: 4096)
agentvox --llm-context-size 8192

# Adjust top-p sampling (default: 0.95)
agentvox --llm-top-p 0.9

Device Configuration

# Auto-detect best available device (default)
agentvox

# Explicitly use CPU
agentvox --device cpu

# Explicitly use CUDA GPU
agentvox --device cuda

# Explicitly use Apple Silicon MPS
agentvox --device mps

The system automatically detects the best available device:

NVIDIA GPU with CUDA → cuda
Apple Silicon → mps
Otherwise → cpu

Combined Examples

# English recognition + fast speech + longer responses
agentvox --stt-language en --tts-speed 1.5 --llm-max-tokens 1024

# High accuracy STT + creative responses + voice cloning
agentvox --stt-beam-size 10 --llm-temperature 0.9 --speaker-wav voice_sample.wav

# Use custom model path with fast TTS
agentvox --model /path/to/your/model.gguf --tts-speed 1.4

Python API Usage

from agentvox import VoiceAssistant, ModelConfig, AudioConfig

# Configuration
model_config = ModelConfig(
    stt_model="base",
    llm_temperature=0.7,
    tts_speed=1.0,  # Adjust TTS speed
    speaker_wav="voice_sample.wav"  # Optional: voice cloning
)

audio_config = AudioConfig()

# Initialize voice assistant
assistant = VoiceAssistant(model_config, audio_config)

# Start conversation
assistant.run_conversation_loop()

Using Individual Modules

from agentvox import STTModule, LLMModule, TTSModule, ModelConfig

config = ModelConfig()

# STT (Speech to Text)
stt = STTModule(config)
text = stt.transcribe("audio.wav")

# LLM (Generate text response)
llm = LLMModule(config)
response = llm.generate_response(text)

# TTS (Text to Speech)
tts = TTSModule(config)
tts.speak(response)

Available Commands During Conversation

"exit" or "종료": Exit the program
"reset" or "초기화": Reset conversation history
"history" or "대화 내역": View conversation history

System Requirements

Python 3.8 or higher
macOS (with MPS support), Linux, Windows
Minimum 8GB RAM (16GB recommended)
Approximately 7GB disk space (for model storage)

Required Packages

torch >= 2.0.0
realtimestt (Real-time speech-to-text)
realtimetts[coqui] (Real-time text-to-speech with Coqui engine)
llama-cpp-python
numpy
pygame
sounddevice
soundfile
pyaudio
hangul-romanize (for Korean language support)

Project Structure

agentvox/
├── agentvox/              # Package directory
│   ├── __init__.py               # Package initialization
│   ├── voice_assistant.py        # Main module
│   ├── cli.py                    # CLI interface
│   └── record_speaker_wav.py     # Voice recording module
├── setup.py                      # Package setup
├── pyproject.toml               # Build configuration
├── requirements.txt             # Dependencies
├── README.md                    # Documentation
└── .gitignore                   # Git ignore file

Troubleshooting

First Run Model Download

On first run, the Coqui TTS model (XTTS v2, ~1.86GB) will be automatically downloaded. This only happens once.

Multiprocessing Errors on macOS/Windows

If you encounter multiprocessing errors, ensure your script uses:

if __name__ == "__main__":
    # Your code here

ctranslate2 Version Issues

If you get ctranslate2 compatibility errors:

pip install ctranslate2==4.4.0

PyAudio Installation Error

macOS:

brew install portaudio
pip install pyaudio

Linux:

sudo apt-get install portaudio19-dev python3-pyaudio
pip install pyaudio

Windows:

# Visual Studio Build Tools required
pip install pipwin
pipwin install pyaudio

Out of Memory

For large LLM models:

Use smaller quantized models
Reduce context size: --llm-context-size 2048
Use CPU mode: --device cpu

Microphone Recognition Issues

Check microphone permissions in system settings
Close other audio applications
Adjust VAD threshold: --stt-vad-threshold 0.3
Reduce silence duration for faster response: --stt-vad-min-silence-duration 500

Speaker Echo/Feedback Issues

If the TTS output is being picked up by the microphone:

Use headphones instead of speakers
Reduce speaker volume
Increase the distance between microphone and speakers
The system automatically pauses STT during TTS playback to minimize echo

Model File Not Found

# Download model
agentvox --download-model

# Or download directly
wget https://huggingface.co/tgisaturday/Docsray/resolve/main/gemma-3-12b-it-GGUF/gemma-3-12b-it-Q4_K_M.gguf \
  -O ~/.agentvox/models/gemma-3-12b-it-Q4_K_M.gguf

Performance Optimization

Improve Response Speed

Use smaller STT model: --stt-model tiny or base
Limit LLM response length: --llm-max-tokens 256
Reduce beam size: --stt-beam-size 3

GPU Acceleration

macOS: Automatic MPS support (--device mps)
NVIDIA GPU: CUDA support (--device cuda)
AMD GPU: Requires PyTorch with ROCm support

Developer Information

Developed by MimicLab at Sogang University and STOIC Entertainment, a leader in XR and AI-driven interactive content.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Third-Party Licenses

This project uses several third-party libraries:

RealtimeSTT: MIT License (for real-time speech-to-text)
RealtimeTTS: MIT License (for real-time text-to-speech)
coqui-tts: Mozilla Public License 2.0 (used by RealtimeTTS for voice synthesis)
faster-whisper: MIT License (used by RealtimeSTT for speech recognition)
llama-cpp-python: MIT License (for LLM inference)
Gemma Model: Check the model provider's license terms

For complete third-party license information, see THIRD_PARTY_LICENSES.md.

Note on copyleft licenses:

The coqui-tts library (MPL-2.0) is used as a dependency through RealtimeTTS. The MPL-2.0 license only requires that modifications to coqui-tts itself be shared, not your application code.
The MPL-2.0 license of coqui-tts does not affect the MIT licensing of this project's source code.

Contributing

Issues and Pull Requests are always welcome!

Development Setup

# Clone repository
git clone https://github.com/yourusername/agentvox.git
cd agentvox

# Install in development mode
pip install -e .

# Run tests
python -m pytest tests/

Multilingual Support

AgentVox supports multiple languages through both STT and TTS:

Speech Recognition (STT)

Set the language with --stt-language:

Korean: ko (default)
English: en
Japanese: ja
Chinese: zh
Spanish: es
French: fr
German: de
And many more...

Text-to-Speech (TTS)

The Coqui XTTS v2 model supports multiple languages automatically. For best results:

Use voice cloning with a native speaker's voice sample
The model will automatically detect and use the appropriate language

Example:

# English conversation with cloned voice
agentvox --stt-language en --speaker-wav english_voice.wav

# Japanese conversation
agentvox --stt-language ja --speaker-wav japanese_voice.wav

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
agentvox		agentvox
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SPEAKER_RECORDING.md		SPEAKER_RECORDING.md
THIRD_PARTY_LICENSES.md		THIRD_PARTY_LICENSES.md
minimal_example.py		minimal_example.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

License

MIMICLab/AgentVox

Folders and files

Latest commit

History

Repository files navigation

🎙️ AgentVox

🆕 What's New (v0.2.0)

Key Features

🤝 Collaboration with STOIC Entertainment

Installation

1. Install via pip

For NVIDIA CUDA Users

2. Download Model

Usage

Basic Usage

Voice Configuration

Advanced Configuration

STT (Speech Recognition) Parameters

LLM (Language Model) Parameters

Device Configuration

Combined Examples

Python API Usage

Using Individual Modules

Available Commands During Conversation

System Requirements

Required Packages

Project Structure

Troubleshooting

First Run Model Download

Multiprocessing Errors on macOS/Windows

ctranslate2 Version Issues

PyAudio Installation Error

Out of Memory

Microphone Recognition Issues

Speaker Echo/Feedback Issues

Model File Not Found

Performance Optimization

Improve Response Speed

GPU Acceleration

Developer Information

License

Third-Party Licenses

Contributing

Development Setup

Multilingual Support

Speech Recognition (STT)

Text-to-Speech (TTS)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Uh oh!

Languages

Packages