Real-Time Voice Conversation System (ASR + LLM + TTS)

Voice Conversation Pipeline (ASR → LLM → TTS) Low-latency, interruption-aware voice assistant pipeline:

Workflow:

Mic → RealtimeSTT → Prompt Build → llama.cpp /completion (SSE) → Token Buffer → Piper TTS → Speaker
          ^                                                           |
          | (pause during prefill)                                    v
    User speech interrupts  <----  Immediate TTS stop & queue flush (barge-in)

[ ] ASR：interruption， latency, and echo cancellation

[ ] LLM: a fine-tuned model with customized vocabulary style

a substitute gguf
a safetensors to be distilled into gguf.(https://chatgpt.com/share/68b3c412-52b4-800e-8b44-267495c715df)
finetune one (for Mandarin)
- collect one set of dataset
- use one plain llm
- get rid of the nsfw impact
- train/finetune
- export to gguf form.

[ ] TTS: voice customization

use openvoice to copy the assigned voice color to a dataset.
use PiperVocie to train one. A gpu-intensive task.

llm models

gguf's

Llama-3.2-3B-Instruct-IQ3_M.gguf (t'was gud.)
llama_3_Base_adult.Q2_K.gguf
gemma-3-270m-it-F16.gguf
LFM2-1.2B-F16.gguf
TinyLlama-1.1B-Chat-v1.0-finetune.f16.gguf
TinyLlama-1.1B-Chat-v1.0.Q8_0.gguf
TinyLlama-1.1B-Chat-v1.0-finetune.Q2_K.gguf
TinyLlama-1.1B-Chat-v1.0-finetune.Q6_K.gguf
SmallThinker-3B-Preview.Q2_K.gguf
SmallThinker-3B-Preview.Q8_0.gguf
minichat-3b.q2_k.gguf
minichat-3b.q8_0.gguf
Qwen2.5_Uncensored_V2_Sexting.gguf (t'was gud.)
nsfw-3b-q4_k_m.gguf
NSFW-Ameba-3.2-1B.f16.gguf
NSFW_13B_sft.Q2_K.gguf

safetensors's

Novaciano/SEX_ROLEPLAY-3.2-1B [gguf'd] Novaciano/SENTIMENTAL_SEX-3.2-1B [gguf'd]

tts

models

zh_CN-huayan-medium.onnx zh_CN-huayan-medium.onnx.json en_US-hfc_female-medium.onnx en_US-hfc_female-medium.onnx.json

Piper ONNX voice model (e.g. en_US-hfc_female-medium.onnx as well as en_US-hfc_female-medium.onnx.json) placed in repo root (gitignored)
- e.g. download the model from https://huggingface.co/csukuangfj/vits-piper-en_US-hfc_female-medium.

how to customize your own voice by recording via vits-piper

https://ssamjh.nz/create-custom-piper-tts-voice/

some collections of piper where you can download the correspoinding onnx-json pair by clicking `download`.

https://rhasspy.github.io/piper-samples/#en_GB-southern_english_female-low

Installation

pip packages

pip install -r requirements.txt

for macOS, if python==3.10, please brew install portaudio before executing the above so that pyaudio can be installed. However, the pyaudio here serves RealtimeSTT, instead of itself, that's why sounddevice is introduced.

llama.cpp

Ensure you have:

llama-server (from llama.cpp) running locally

e.g. 1

./llama-server --model ./models/your-model.Q4_K_M.gguf --host 0.0.0.0 --port 8080 \
  --ctx-size 4096 --parallel 2 --no-mmap

Tune args for your hardware (quant, threads, ctx-size). For fastest first token, prefer smaller / quantized model.

e.g. 2

llama-server -m Llama-3.2-3B-Instruct-IQ3_M.gguf

which can be verified by calling

curl http://localhost:8080/completion -d '{
  "prompt": "Your prompt here",
  "n_predict": 128
}'

if get results like

{"index":0,"content":" It's great to be here. I'm so excited ...

Then it's okay.

voiceprint

the folder of voiceprint/ illustrates the voiceprint registration from human throat via recording vs. from onnx model, and its comparison.

to run them, you need to do:

pip install resemblyzer librosa numpy
pip install soundfile

record_voice.py is how human read and get wav file.
tts_generate.py is how onnx creates its wav file according to the piece of text.
voiceprint_compare.py is how to compare the voiceprint between wav files.

vanilla_PiperVoice/

run:

python asr_llm_tts.py --llm-url http://localhost:8080 --tts-model en_US-hfc_female-medium.onnx

for macOS ---

pip install "httpx[http2]"
pip install sounddevice

Make sure the numpy used is of version less than 2.0.

sometimes the dialogue stops because the llm part is not functioning properly.

Let us disentangle the asr_llm_tts.py's code structure.

initializes asr, llm, and tts.
Creates an audio playback worker thread that continuously monitors an audio queue
- interrupt_tts : Event to signal TTS interruption
- user_speaking : Event indicating user is currently speaking
- ai_should_be_quiet : Event to prevent AI from speaking
- stream_generation : Counter to cancel stale responses
asr calback registration
- _on_recording_start : Triggered when user starts speaking
- _on_recording_stop : Triggered when user stops speaking
- _on_transcription_start : Triggered when transcription begins

When the user starts speaking while AI is talking:

Voice Activity Detection (VAD)

1. Immediate Detection : _on_recording_start callback fires instantly
1. Generation Increment : Increments stream_generation counter, making any ongoing LLM/TTS responses "stale"
1. Flag Setting : Sets multiple interruption flags:
- user_speaking.set() : Indicates user is actively speaking
- ai_should_be_quiet.set() : Prevents new AI speech
1. Emergency Stop : Calls interrupt_tts_immediately which sets interrupt_tts event

Audio Interruption

The audio_playback_worker continuously monitors interruption flags:

Queue Clearing : When interruption is detected, immediately clears all pending audio chunks from the queue
Audio Stop : On macOS, calls sd.stop() to halt current audio playback
Chunk Skipping : Any remaining audio chunks are discarded without playing

Speech Recognition

Transcription : User's speech is transcribed in real-time
Processing : When user stops speaking, _on_recording_stop clears interruption flags
Callback : Recognized text is passed to process_speech_input

voiceprint/

user's sound is pre-recorded in voiceprint/ dir. Now， it can know how similar the audio is with the pre-recorded sample wav file.

livekit_appended/: LiveKit + WebRTC AEC

LiveKit is built on top of WebRTC, and WebRTC has battle-tested Acoustic Echo Cancellation (AEC)

rewrite the asr_llm_tts.py upon the foundation of WebRTC and LiveKit.
apply AEC then.

temp/: liveKit-based

The main idea is to use livekit to wrap up all asr-llm-tts stuff and let gpt-5 and claude-4o to establish a system where:

user gets response from chatbox
user can interrupt
the robot never falls into the loops of answering its own output But it seems that the problems (especially the conversation loop one) has no one fixed that publicly. (We stand to be corrected.)
https://chatgpt.com/share/68b7e7a6-6cb0-800e-9af5-3402967aecb2

where:

go build -v -o livekit-server ./cmd/livekit-server

is changed to

go build -v -o livekit-server ./cmd/server

https://chatgpt.com/share/68b7effc-2580-800e-bb99-c67afb1ca8f0

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
livekit_appended		livekit_appended
temp		temp
vanilla_CoquiTTS		vanilla_CoquiTTS
vanilla_PiperVoice		vanilla_PiperVoice
voiceprint		voiceprint
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements_py310.txt		requirements_py310.txt
requirements_py310_livekit.txt		requirements_py310_livekit.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Real-Time Voice Conversation System (ASR + LLM + TTS)

llm models

gguf's

safetensors's

tts

models

how to customize your own voice by recording via vits-piper

some collections of piper where you can download the correspoinding onnx-json pair by clicking `download`.

Installation

pip packages

llama.cpp

voiceprint

vanilla_PiperVoice/

When the user starts speaking while AI is talking:

Voice Activity Detection (VAD)

Audio Interruption

Speech Recognition

voiceprint/

livekit_appended/: LiveKit + WebRTC AEC

temp/: liveKit-based

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

drk-m-s/on_device_asr-llm-tts

Folders and files

Latest commit

History

Repository files navigation

Real-Time Voice Conversation System (ASR + LLM + TTS)

llm models

gguf's

safetensors's

tts

models

how to customize your own voice by recording via vits-piper

some collections of piper where you can download the correspoinding onnx-json pair by clicking download.

Installation

pip packages

llama.cpp

voiceprint

vanilla_PiperVoice/

When the user starts speaking while AI is talking:

Voice Activity Detection (VAD)

Audio Interruption

Speech Recognition

voiceprint/

livekit_appended/: LiveKit + WebRTC AEC

temp/: liveKit-based

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

some collections of piper where you can download the correspoinding onnx-json pair by clicking `download`.

Packages