A fully local speech-to-speech AI pipeline combining real-time speech recognition, LLM reasoning, tool augmentation, and low-latency text-to-speech—built to run entirely on-device for faster, private, and extensible interactions.
User Speech → STT → LLM (Tool-Augmented) → TTS → Audio Response
- STT: Converts audio into text (RealtimeSTT)
- LLM + Tools: Text input is processed by a local LLM (in
models/
) which can invoke external tools (fromtools/
) - TTS: Streams the LLM response as audio (RealtimeTTS)
# Install uv
pip install uv
git clone https://github.com/ThePickleGawd/realtime-speech-agents.git
cd realtime-speech-agents
uv sync
uv run models/V1.py
V1.py
,V2.py
,V3.py
: Variants of the core speech agent models.- Includes LLM orchestration logic, response synthesis, and tool invocation.
- See paper for more details
Speak into the mic — your agent will respond in real time.