Real-time audio transcription and translation using Whisper and multilingual models (SeamlessM4T / NLLB‑200)
🌐 Languages: English | Español
Marvin4000 captures, transcribes, and translates system audio in real-time using local hardware.
⚠️ IMPORTANT:
- If you're on Windows, audio capture must be manually implemented using an alternative to
parec
that provides system audio data infloat32
format.
GPU & Models Used | Latency (s) | WER | BLEU-1/4/Corpus | VRAM |
---|---|---|---|---|
RTX 4060 Ti 16GB whisper-large-v3-turbo + nllb-200-3.3B |
2-3 | 6 % | 75/38/54 | 14.2 GB |
RTX 4060 Ti 16GB whisper-large-v3-turbo + seamless-m4t-v2-large |
2-3 | 6 % | 74/39/52 | 11.4 GB |
- Audio: 25 random audiobook fragments from LibriSpeech (avg: 5 min/fragment)
- Reference Transcription: Official LibriSpeech transcriptions
- Reference Translation: Generated with Claude & GPT and manually reviewed (English → Spanish)
- Total Evaluated: ~120 minutes of audio
- WER: Calculated with jiwer, normalized for punctuation
- BLEU: Corpus-level implementation with lowercase tokenization, n-gram clipping and brevity penalty
- BLEU-1/4/Corpus: 1-gram / 4-gram precision / full corpus score
- Latency: Measured under real conditions with RTX 4060 Ti 16GB and RTX 2060 6GB
While reference translations are high quality, we acknowledge they are not equivalent to professional human translations. However, they provide a consistent standard for comparing system performance, following methodologies similar to those employed in evaluations like FLEURS and CoVoST 2.
sudo apt install python3-pip pulseaudio-utils ffmpeg
git clone https://github.com/XOREngine/marvin4000.git
cd marvin4000
pip install -r requirements.txt
# 1. Play some audio content on your system
vlc example_video.mp4
# ffmpeg.ffplay -nodisp -autoexit -ss 1 example.mp3
# or play audio from browser, etc.
# 2. Detect valid audio devices
python detect_audio_devices.py
# Example output:
# $ python marvin4000_seam.py --audio-device "alsa_output.pci-0000_00_1f.3.analog-stereo.monitor"
# 3. Start transcription/translation with appropriate monitor device
python marvin4000_seam.py --audio-device "alsa_output.pci-0000_00_1f.3.analog-stereo.monitor"
python marvin4000_nllb.py --audio-device "alsa_output.pci-0000_00_1f.3.analog-stereo.monitor" --asr-lang "de" --nmt-source "deu_Latn" --nmt-target "spa_Latn"
Marvin4000 uses SeamlessM4T and NLLB‑200 for transcription and translation between 100+ languages. Supports real-time multilingual applications.
- Threading Separation: Audio capture | ASR | NMT. 68% latency reduction
- Int8 Quantization: bits-and-bytes implementation for models
- Intelligent VAD: WebRTC + conservative segmentation (1.2s minimum silence) + linguistic validation
- Memory Efficient: Circular buffer + translation cache (0.95 similarity)
- Hybrid Latency: Progressive partials (2-3s perceived) with explicit
attention_mask
for enhanced ASR control - Adaptive Segmentation: Avoids <0.5s fragments, 2.5s minimum cuts
- Forced Decoding: Use of
forced_decoder_ids
to indicate language and task to Whisper, improving transcription accuracy
Note: If you experience too much latency, you can reduce
num_beams
or shortenmax_new_tokens
. This will make inferences faster at the cost of slight quality loss.
Segmentation and Flow:
TIMEOUT_SEC = 12.0 # Maximum time without flush
MIN_SEGMENT_SEC = 0.5 # Minimum accepted segment duration
MIN_PARTIAL_WORDS = 5 # Minimum words to show partial
REUSE_THRESHOLD = 0.95 # Similarity threshold for cache
SILENCE_SEC = 0.8 # Silence required for segmentation
VAD_SILENCE_DURATION_SEC = 1.2
MIN_CUT_DURATION_SEC = 2.5
AUDIO_RMS_THRESHOLD = 0.0025 # Minimum accepted volume level
ASR Inference (Whisper):
gen = self.asr.generate(
feats,
attention_mask=attn,
forced_decoder_ids=forced,
max_length=448,
num_beams=3,
early_stopping=True,
temperature=0.0,
repetition_penalty=1.1,
no_repeat_ngram_size=3,
return_timestamps=False,
use_cache=True,
)
NMT Inference (NLLB-200):
generated_tokens = self.nmt_model.generate(
**inputs,
forced_bos_token_id=forced_bos_token_id,
max_length=120,
min_length=8,
num_beams=4,
do_sample=False,
repetition_penalty=1.1,
no_repeat_ngram_size=2,
early_stopping=True,
)
For GPUs with >20GB VRAM (RTX 4090, A40, A100), CUDA streams can be implemented for ASR/NMT parallelization:
# Suggested modifications for high-end hardware:
asr_lock = threading.Lock() # Instead of shared gpu_lock
nmt_lock = threading.Lock() # Independent locks
stream_asr = torch.cuda.Stream()
stream_nmt = torch.cuda.Stream()
# Estimated potential improvement: +15-25% throughput
- Marvin4000 Code: MIT
- Whisper: MIT (OpenAI)
- SeamlessM4T: CC-BY-NC 4.0 (Meta AI)
- NLLB-200: CC-BY-NC 4.0 (Meta AI)
- ggerganov/whisper.cpp – real-time execution
- TimDettmers/bitsandbytes – quantization
- guillaumekln/faster-whisper – efficient buffering
- snakers4/silero-vad – optimized VAD
- Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
- SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
- NLLB-200: No Language Left Behind
- Efficient Low-Bit Quantization of Transformer-Based Language Models
This project is designed as a flexible foundation. If you want to modify it, use it creatively, improve it, or simply adapt it to your needs...
💪 Go for it.
If you also share improvements or mention us as a reference, it will always be welcome 🙌😜.
© XOREngine · Open source commitment