Marvin4000

Real-time audio transcription and translation using Whisper and multilingual models (SeamlessM4T / NLLB‑200)

🌐 Languages: English | Español

Marvin4000 captures, transcribes, and translates system audio in real-time using local hardware.

⚠️ IMPORTANT:

If you're on Windows, audio capture must be manually implemented using an alternative to parec that provides system audio data in float32 format.

📊 Proven Performance

GPU & Models Used	Latency (s)	WER	BLEU-1/4/Corpus	VRAM
RTX 4060 Ti 16GB whisper-large-v3-turbo + nllb-200-3.3B	2-3	6 %	75/38/54	14.2 GB
RTX 4060 Ti 16GB whisper-large-v3-turbo + seamless-m4t-v2-large	2-3	6 %	74/39/52	11.4 GB

Test Corpus

Audio: 25 random audiobook fragments from LibriSpeech (avg: 5 min/fragment)
Reference Transcription: Official LibriSpeech transcriptions
Reference Translation: Generated with Claude & GPT and manually reviewed (English → Spanish)
Total Evaluated: ~120 minutes of audio

Metrics Calculation

WER: Calculated with jiwer, normalized for punctuation
BLEU: Corpus-level implementation with lowercase tokenization, n-gram clipping and brevity penalty
BLEU-1/4/Corpus: 1-gram / 4-gram precision / full corpus score
Latency: Measured under real conditions with RTX 4060 Ti 16GB and RTX 2060 6GB

Limitations

While reference translations are high quality, we acknowledge they are not equivalent to professional human translations. However, they provide a consistent standard for comparing system performance, following methodologies similar to those employed in evaluations like FLEURS and CoVoST 2.

🚀 Installation and Usage

Requirements

sudo apt install python3-pip pulseaudio-utils ffmpeg
git clone https://github.com/XOREngine/marvin4000.git
cd marvin4000
pip install -r requirements.txt

Basic Execution

# 1. Play some audio content on your system
vlc example_video.mp4
# ffmpeg.ffplay -nodisp -autoexit -ss 1 example.mp3
# or play audio from browser, etc.

# 2. Detect valid audio devices
python detect_audio_devices.py
# Example output:
# $ python marvin4000_seam.py --audio-device "alsa_output.pci-0000_00_1f.3.analog-stereo.monitor"

# 3. Start transcription/translation with appropriate monitor device
python marvin4000_seam.py --audio-device "alsa_output.pci-0000_00_1f.3.analog-stereo.monitor"

python marvin4000_nllb.py --audio-device "alsa_output.pci-0000_00_1f.3.analog-stereo.monitor" --asr-lang "de" --nmt-source "deu_Latn" --nmt-target "spa_Latn"

Language Configuration

Marvin4000 uses SeamlessM4T and NLLB‑200 for transcription and translation between 100+ languages. Supports real-time multilingual applications.

🔬 Technical Architecture

Threading Separation: Audio capture | ASR | NMT. 68% latency reduction
Int8 Quantization: bits-and-bytes implementation for models
Intelligent VAD: WebRTC + conservative segmentation (1.2s minimum silence) + linguistic validation
Memory Efficient: Circular buffer + translation cache (0.95 similarity)
Hybrid Latency: Progressive partials (2-3s perceived) with explicit attention_mask for enhanced ASR control
Adaptive Segmentation: Avoids <0.5s fragments, 2.5s minimum cuts
Forced Decoding: Use of forced_decoder_ids to indicate language and task to Whisper, improving transcription accuracy

Adjustable Configuration Parameters

Note: If you experience too much latency, you can reduce num_beams or shorten max_new_tokens. This will make inferences faster at the cost of slight quality loss.

Segmentation and Flow:

TIMEOUT_SEC = 12.0           # Maximum time without flush
MIN_SEGMENT_SEC = 0.5        # Minimum accepted segment duration
MIN_PARTIAL_WORDS = 5        # Minimum words to show partial
REUSE_THRESHOLD = 0.95       # Similarity threshold for cache
SILENCE_SEC = 0.8            # Silence required for segmentation
VAD_SILENCE_DURATION_SEC = 1.2
MIN_CUT_DURATION_SEC = 2.5
AUDIO_RMS_THRESHOLD = 0.0025 # Minimum accepted volume level

ASR Inference (Whisper):

gen = self.asr.generate(
    feats,
    attention_mask=attn,
    forced_decoder_ids=forced,
    max_length=448,
    num_beams=3,
    early_stopping=True,
    temperature=0.0,
    repetition_penalty=1.1,
    no_repeat_ngram_size=3,
    return_timestamps=False,
    use_cache=True,
)

NMT Inference (NLLB-200):

generated_tokens = self.nmt_model.generate(
    **inputs,
    forced_bos_token_id=forced_bos_token_id,
    max_length=120,              
    min_length=8,                
    num_beams=4,                 
    do_sample=False,             
    repetition_penalty=1.1,      
    no_repeat_ngram_size=2,      
    early_stopping=True,         
)

Optimizations for High-End Hardware

For GPUs with >20GB VRAM (RTX 4090, A40, A100), CUDA streams can be implemented for ASR/NMT parallelization:

# Suggested modifications for high-end hardware:
asr_lock = threading.Lock()     # Instead of shared gpu_lock
nmt_lock = threading.Lock()     # Independent locks

stream_asr = torch.cuda.Stream()
stream_nmt = torch.cuda.Stream()
# Estimated potential improvement: +15-25% throughput

📜 Models and Licenses

Marvin4000 Code: MIT
Whisper: MIT (OpenAI)
SeamlessM4T: CC-BY-NC 4.0 (Meta AI)
NLLB-200: CC-BY-NC 4.0 (Meta AI)

🙏 Acknowledgments and References

Models and Libraries Used

Technical Inspiration and Papers

ggerganov/whisper.cpp – real-time execution
TimDettmers/bitsandbytes – quantization
guillaumekln/faster-whisper – efficient buffering
snakers4/silero-vad – optimized VAD
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
NLLB-200: No Language Left Behind
Efficient Low-Bit Quantization of Transformer-Based Language Models

This project is designed as a flexible foundation. If you want to modify it, use it creatively, improve it, or simply adapt it to your needs...

💪 Go for it.

If you also share improvements or mention us as a reference, it will always be welcome 🙌😜.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.es.md		README.es.md
README.md		README.md
detect_audio_devices.py		detect_audio_devices.py
evaluate_bleu.js		evaluate_bleu.js
evaluate_wer.py		evaluate_wer.py
marvin4000_nllb.py		marvin4000_nllb.py
marvin4000_seam.py		marvin4000_seam.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Marvin4000

📊 Proven Performance

Test Corpus

Metrics Calculation

Limitations

🚀 Installation and Usage

Requirements

Basic Execution

Language Configuration

🔬 Technical Architecture

Adjustable Configuration Parameters

Optimizations for High-End Hardware

📜 Models and Licenses

🙏 Acknowledgments and References

Models and Libraries Used

Technical Inspiration and Papers

About

Uh oh!

Releases

Uh oh!

Languages

License

XOREngine/Marvin4000

Folders and files

Latest commit

History

Repository files navigation

Marvin4000

📊 Proven Performance

Test Corpus

Metrics Calculation

Limitations

🚀 Installation and Usage

Requirements

Basic Execution

Language Configuration

🔬 Technical Architecture

Adjustable Configuration Parameters

Optimizations for High-End Hardware

📜 Models and Licenses

🙏 Acknowledgments and References

Models and Libraries Used

Technical Inspiration and Papers

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Languages