Skip to content

XOREngine/Marvin4000

Repository files navigation

Marvin4000

Real-time audio transcription and translation using Whisper and multilingual models (SeamlessM4T / NLLB‑200)

License Python CUDA

🌐 Languages: English | Español


Marvin4000 captures, transcribes, and translates system audio in real-time using local hardware.


⚠️ IMPORTANT:

  • If you're on Windows, audio capture must be manually implemented using an alternative to parec that provides system audio data in float32 format.

📊 Proven Performance

GPU & Models Used Latency (s) WER BLEU-1/4/Corpus VRAM
RTX 4060 Ti 16GB
whisper-large-v3-turbo + nllb-200-3.3B
2-3 6 % 75/38/54 14.2 GB
RTX 4060 Ti 16GB
whisper-large-v3-turbo + seamless-m4t-v2-large
2-3 6 % 74/39/52 11.4 GB

Test Corpus

  • Audio: 25 random audiobook fragments from LibriSpeech (avg: 5 min/fragment)
  • Reference Transcription: Official LibriSpeech transcriptions
  • Reference Translation: Generated with Claude & GPT and manually reviewed (English → Spanish)
  • Total Evaluated: ~120 minutes of audio

Metrics Calculation

  • WER: Calculated with jiwer, normalized for punctuation
  • BLEU: Corpus-level implementation with lowercase tokenization, n-gram clipping and brevity penalty
  • BLEU-1/4/Corpus: 1-gram / 4-gram precision / full corpus score
  • Latency: Measured under real conditions with RTX 4060 Ti 16GB and RTX 2060 6GB

Limitations

While reference translations are high quality, we acknowledge they are not equivalent to professional human translations. However, they provide a consistent standard for comparing system performance, following methodologies similar to those employed in evaluations like FLEURS and CoVoST 2.


🚀 Installation and Usage

Requirements

sudo apt install python3-pip pulseaudio-utils ffmpeg
git clone https://github.com/XOREngine/marvin4000.git
cd marvin4000
pip install -r requirements.txt

Basic Execution

# 1. Play some audio content on your system
vlc example_video.mp4
# ffmpeg.ffplay -nodisp -autoexit -ss 1 example.mp3
# or play audio from browser, etc.

# 2. Detect valid audio devices
python detect_audio_devices.py
# Example output:
# $ python marvin4000_seam.py --audio-device "alsa_output.pci-0000_00_1f.3.analog-stereo.monitor"

# 3. Start transcription/translation with appropriate monitor device
python marvin4000_seam.py --audio-device "alsa_output.pci-0000_00_1f.3.analog-stereo.monitor"

python marvin4000_nllb.py --audio-device "alsa_output.pci-0000_00_1f.3.analog-stereo.monitor" --asr-lang "de" --nmt-source "deu_Latn" --nmt-target "spa_Latn"

Language Configuration

Marvin4000 uses SeamlessM4T and NLLB‑200 for transcription and translation between 100+ languages. Supports real-time multilingual applications.


🔬 Technical Architecture

  • Threading Separation: Audio capture | ASR | NMT. 68% latency reduction
  • Int8 Quantization: bits-and-bytes implementation for models
  • Intelligent VAD: WebRTC + conservative segmentation (1.2s minimum silence) + linguistic validation
  • Memory Efficient: Circular buffer + translation cache (0.95 similarity)
  • Hybrid Latency: Progressive partials (2-3s perceived) with explicit attention_mask for enhanced ASR control
  • Adaptive Segmentation: Avoids <0.5s fragments, 2.5s minimum cuts
  • Forced Decoding: Use of forced_decoder_ids to indicate language and task to Whisper, improving transcription accuracy

Adjustable Configuration Parameters

Note: If you experience too much latency, you can reduce num_beams or shorten max_new_tokens. This will make inferences faster at the cost of slight quality loss.

Segmentation and Flow:

TIMEOUT_SEC = 12.0           # Maximum time without flush
MIN_SEGMENT_SEC = 0.5        # Minimum accepted segment duration
MIN_PARTIAL_WORDS = 5        # Minimum words to show partial
REUSE_THRESHOLD = 0.95       # Similarity threshold for cache
SILENCE_SEC = 0.8            # Silence required for segmentation
VAD_SILENCE_DURATION_SEC = 1.2
MIN_CUT_DURATION_SEC = 2.5
AUDIO_RMS_THRESHOLD = 0.0025 # Minimum accepted volume level

ASR Inference (Whisper):

gen = self.asr.generate(
    feats,
    attention_mask=attn,
    forced_decoder_ids=forced,
    max_length=448,
    num_beams=3,
    early_stopping=True,
    temperature=0.0,
    repetition_penalty=1.1,
    no_repeat_ngram_size=3,
    return_timestamps=False,
    use_cache=True,
)

NMT Inference (NLLB-200):

generated_tokens = self.nmt_model.generate(
    **inputs,
    forced_bos_token_id=forced_bos_token_id,
    max_length=120,              
    min_length=8,                
    num_beams=4,                 
    do_sample=False,             
    repetition_penalty=1.1,      
    no_repeat_ngram_size=2,      
    early_stopping=True,         
)

Optimizations for High-End Hardware

For GPUs with >20GB VRAM (RTX 4090, A40, A100), CUDA streams can be implemented for ASR/NMT parallelization:

# Suggested modifications for high-end hardware:
asr_lock = threading.Lock()     # Instead of shared gpu_lock
nmt_lock = threading.Lock()     # Independent locks

stream_asr = torch.cuda.Stream()
stream_nmt = torch.cuda.Stream()
# Estimated potential improvement: +15-25% throughput

📜 Models and Licenses


🙏 Acknowledgments and References

Models and Libraries Used

Technical Inspiration and Papers



This project is designed as a flexible foundation. If you want to modify it, use it creatively, improve it, or simply adapt it to your needs...

💪 Go for it.

If you also share improvements or mention us as a reference, it will always be welcome 🙌😜.


© XOREngine · Open source commitment