Skip to content

kristofferv98/Realtime_mlx_STT

Repository files navigation

Realtime_mlx_STT

PyPI version Python 3.9+ License: MIT Platform

High-performance speech-to-text transcription library optimized exclusively for Apple Silicon. Leverages MLX framework for real-time on-device transcription with low latency.

⚠️ IMPORTANT: This library is designed for LOCAL USE ONLY on macOS with Apple Silicon. The included server is a development tool and should NOT be exposed to the internet or used in production environments without implementing proper security measures.

Features

  • Real-time transcription with low latency using MLX Whisper
  • Auto-stop after silence - Voice assistant-like "record until silence" behavior
  • Multiple APIs - Python API, REST API, and WebSocket for different use cases
  • Apple Silicon optimization using MLX with Neural Engine acceleration
  • Voice activity detection with WebRTC and Silero (configurable thresholds)
  • Wake word detection using Porcupine ("Jarvis", "Alexa", etc.)
  • OpenAI integration for cloud-based transcription alternative
  • Interactive CLI for easy exploration of features
  • Web UI with modern interface and real-time updates
  • Profile system for quick configuration switching
  • Event-driven architecture with command pattern
  • Thread-safe and production-ready

Language Selection

The Whisper large-v3-turbo model supports 99 languages with intelligent language detection:

  • Language-specific mode: When you select a specific language (e.g., Norwegian, French, Spanish), the model uses language-specific tokens that significantly improve transcription accuracy for that language
  • Multi-language capability: Even with a language selected, Whisper can still transcribe other languages if spoken - it's not restricted to only the selected language
  • Accuracy benefit: Selecting the primary language you'll be speaking provides much more accurate transcription compared to auto-detect mode
  • Auto-detect mode: When no language is specified, the model attempts to detect the language automatically, though with potentially lower accuracy

For example, if you select Norwegian (no) as your language:

  • Norwegian speech will be transcribed with high accuracy
  • English speech will still be transcribed correctly if spoken
  • The model uses the Norwegian language token (50288) to optimize for Norwegian

This behavior matches OpenAI's Whisper API - the language parameter guides but doesn't restrict the model.

Requirements

  • macOS with Apple Silicon (M1/M2/M3) - Required, not optional
  • Python 3.9+ (3.11+ recommended for best performance)
  • MLX for Apple Silicon optimization
  • PyAudio for audio capture
  • WebRTC VAD and Silero VAD for voice activity detection
  • Porcupine for wake word detection (optional)
  • Torch and NumPy for audio processing

Important Note: This library is specifically optimized for Apple Silicon and will not work on Intel-based Macs or other platforms. It requires the Neural Engine found in Apple Silicon chips to achieve optimal performance.

Installation

Install from PyPI (Recommended)

# Basic installation
pip install realtime-mlx-stt

# With OpenAI support for cloud transcription
pip install "realtime-mlx-stt[openai]"

# With development tools
pip install "realtime-mlx-stt[dev]"

# With server support for REST/WebSocket APIs
pip install "realtime-mlx-stt[server]"

# Install everything
pip install "realtime-mlx-stt[openai,server,dev]"

📚 Documentation

Install from Source

# Clone the repository
git clone https://github.com/kristofferv98/Realtime_mlx_STT.git
cd Realtime_mlx_STT

# Set up Python environment (requires Python 3.9+ but 3.11+ recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e .

Quick Start

Interactive CLI (Recommended)

The easiest way to explore all features:

python examples/cli.py

This provides a menu-driven interface for:

  • Quick 10-second transcription
  • Continuous streaming mode
  • OpenAI cloud transcription
  • Wake word detection
  • Audio device selection
  • Language configuration

Python API

from realtime_mlx_stt import STTClient

# Single utterance transcription (most common use case)
client = STTClient()
text = client.transcribe_utterance()
print(f"You said: {text}")

# Configure VAD settings at client level
client = STTClient(
    vad_sensitivity=0.6,
    vad_min_silence_duration=2.0,
    default_language="en"
)

# Voice command pattern (auto-stops after silence)
for result in client.transcribe():  # auto_stop_after_silence=True by default
    print(result.text)

# Continuous streaming (no auto-stop)
with client.stream() as stream:  # auto_stop_after_silence=False by default
    for result in stream:
        print(result.text)
        if "stop" in result.text.lower():
            break

# Voice assistant pattern
while True:
    text = client.transcribe_utterance()
    if "quit" in text.lower():
        break
    print(f"Command: {text}")

# With OpenAI
client = STTClient(openai_api_key="sk-...")
text = client.transcribe_utterance(engine="openai")

# Wake word mode
client.start_wake_word("jarvis")

Server Mode

Security Note: The server is for local development only and binds to localhost by default. Do NOT expose it to the internet without proper authentication and security measures.

# Start server (localhost only - safe)
cd example_server
python server_example.py

# Opens web UI at http://localhost:8000

Auto-Stop After Silence

The library now supports automatic stopping after silence detection, providing voice assistant-like behavior where transcription automatically ends after a period of silence.

Key Features

  • Configurable silence timeout: Set how long to wait before stopping (default: 2.0 seconds)
  • Per-call override: Enable/disable auto-stop for specific transcription calls
  • Multiple API support: Works with transcribe(), stream(), and transcribe_until_silence()
  • Convenience method: transcribe_until_silence() for simple use cases

Usage Examples

from realtime_mlx_stt import STTClient

# 1. Single utterance (most common) - NEW!
client = STTClient()
text = client.transcribe_utterance()
print(f"You said: {text}")

# 2. Configure VAD at client level - NEW!
client = STTClient(
    vad_sensitivity=0.6,
    vad_min_silence_duration=2.0,
    default_language="en"
)

# 3. Fast startup mode (reduces ~500ms to <100ms) - NEW!
client = STTClient(fast_start=True)
text = client.transcribe_utterance()  # Starts much faster

# 4. Ultra-fast recording (starts in <50ms) - NEW!
client = STTClient(fast_start=True)
client.start_recording_immediate()  # Starts in <50ms
client.wait_for_ready()             # Wait for models to load
text = client.get_transcription()   # Get the result

# 5. Voice command pattern (auto-stops by default)
for result in client.transcribe():
    print(result.text)

# 6. Continuous streaming (no auto-stop by default)
with client.stream() as stream:
    for result in stream:
        print(result.text)
        if "stop" in result.text.lower():
            break

# 7. Voice assistant pattern - simplified
while True:
    text = client.transcribe_utterance()
    if "quit" in text.lower():
        break
    print(f"Command: {text}")

Configuration Options

  • vad_sensitivity: Voice activity detection sensitivity (0.0-1.0, default: 0.5)
  • vad_min_silence_duration: Minimum silence duration to end speech (seconds, default: 2.0)
  • vad_min_speech_duration: Minimum speech duration to start transcription (seconds, default: 0.25)
  • auto_stop_after_silence: Enable/disable auto-stop behavior (default: False for client, True for transcribe(), False for stream())
  • silence_timeout: Override silence timeout (uses vad_min_silence_duration if None)
  • max_duration: Maximum recording duration for safety (default: 30.0 for utterance, 60.0 for others)

Performance Optimization

  • fast_start: Enable fast startup mode (reduces ~500ms to <100ms, default: False)

Environment Variables:

  • PRELOAD_STT_MODELS=true: Pre-load models at import time for even faster startup

Architecture

The library provides two specialized interfaces built on a common Features layer:

┌─────────────────────────────────────────────────┐
│          User Interfaces                         │
│  • CLI (examples/cli.py)                        │
│  • Web UI (example_server/)                     │
├─────────────────────────────────────────────────┤
│          API Layers                             │
│  • Python API (realtime_mlx_stt/)              │
│  • REST/WebSocket (src/Application/Server/)    │
├─────────────────────────────────────────────────┤
│          Features Layer                         │
│  • AudioCapture                                │
│  • VoiceActivityDetection                      │
│  • Transcription (MLX/OpenAI)                  │
│  • WakeWordDetection                           │
├─────────────────────────────────────────────────┤
│          Core & Infrastructure                  │
│  • Command/Event System                         │
│  • Logging & Configuration                      │
└─────────────────────────────────────────────────┘

Key Design Principles

  • Vertical Slice Architecture: Each feature is self-contained with Commands, Events, Handlers, and Models
  • Dual API Design: Python API optimized for direct use, Server API optimized for multi-client scenarios
  • Event-Driven: Features communicate via commands and events, not direct dependencies
  • Production Ready: Thread-safe, lazy initialization, comprehensive error handling

API Documentation

Python API (realtime_mlx_stt)

from realtime_mlx_stt import STTClient, TranscriptionSession, create_transcriber

# Method 1: Modern Client API
client = STTClient(
    openai_api_key="sk-...",     # Optional
    default_engine="mlx_whisper", # or "openai"
    default_language="en"         # or None for auto-detect
)

# Transcribe for fixed duration
for result in client.transcribe(duration=10):
    print(f"{result.text} (confidence: {result.confidence})")

# Streaming with stop word
with client.stream() as stream:
    for result in stream:
        print(result.text)
        if "stop" in result.text.lower():
            break

# Method 2: Session-based API
from realtime_mlx_stt import TranscriptionSession, ModelConfig, VADConfig

session = TranscriptionSession(
    model=ModelConfig(engine="mlx_whisper", language="no"),
    vad=VADConfig(sensitivity=0.8),
    on_transcription=lambda r: print(r.text)
)

with session:
    time.sleep(30)  # Listen for 30 seconds

# Method 3: Simple Transcriber
from realtime_mlx_stt import Transcriber
transcriber = Transcriber(language="es")
text = transcriber.transcribe_from_mic(duration=5)
print(f"You said: {text}")

REST API

# Start system with profile
curl -X POST http://localhost:8000/api/v1/system/start \
  -H "Content-Type: application/json" \
  -d '{
    "profile": "vad-triggered",
    "custom_config": {
      "transcription": {"language": "fr"},
      "vad": {"sensitivity": 0.7}
    }
  }'

# Get system status
curl http://localhost:8000/api/v1/system/status

# Transcribe audio file
curl -X POST http://localhost:8000/api/v1/transcription/audio \
  -H "Content-Type: application/json" \
  -d '{"audio_data": "base64_encoded_audio_data"}'

WebSocket Events

const ws = new WebSocket('ws://localhost:8000/events');

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    
    switch(data.type) {
        case 'transcription':
            if (data.is_final) {
                console.log(`Final: ${data.text}`);
            } else {
                console.log(`Transcribing: ${data.text}`);
            }
            break;
        case 'wake_word':
            console.log(`Wake word: ${data.wake_word}`);
            break;
    }

Configuration

Environment Variables

# API Keys
export OPENAI_API_KEY="sk-..."        # For OpenAI transcription
export PORCUPINE_ACCESS_KEY="..."     # For wake word detection
# Alternative names for Picovoice universal key (same as PORCUPINE_ACCESS_KEY):
# export PICOVOICE_ACCESS_KEY="..."
# export PICOVOICE_API_KEY="..."

# Logging
export LOG_LEVEL="INFO"               # DEBUG, INFO, WARNING, ERROR
export LOG_FORMAT="human"             # human, json, detailed

Python Configuration

from realtime_mlx_stt import ModelConfig, VADConfig, WakeWordConfig

# Model configuration
model = ModelConfig(
    engine="mlx_whisper",        # or "openai"
    model="whisper-large-v3-turbo",
    language="en"                # or None for auto-detect
)

# VAD configuration
vad = VADConfig(
    enabled=True,
    sensitivity=0.6,             # 0.0-1.0
    min_speech_duration=0.25,    # seconds
    min_silence_duration=0.1     # seconds
)

# Wake word configuration
# Note: Requires PORCUPINE_ACCESS_KEY environment variable
wake_word = WakeWordConfig(
    words=["jarvis", "computer"],
    sensitivity=0.7,
    timeout=30                   # seconds
)

## Testing

The project includes comprehensive tests for each feature and component:

```bash
# Run all tests
python tests/run_tests.py

# Run tests for a specific feature or component
python tests/run_tests.py -f VoiceActivityDetection
python tests/run_tests.py -f Infrastructure
python tests/run_tests.py -f Application  # Server/Client tests

# Run a specific test with verbose output
python tests/run_tests.py -t webrtc_vad_test -v
python tests/run_tests.py -t test_server_module -v

# Test with PYTHONPATH (if imports fail)
PYTHONPATH=/path/to/Realtime_mlx_STT python tests/run_tests.py

The Server implementation includes tests for:

  • API Controllers (Transcription and System)
  • WebSocket connections and event broadcasting
  • Configuration and profile management
  • Command/Event integration

Performance

On Apple Silicon (M1/M2/M3), the MLX-optimized Whisper-large-v3-turbo model typically achieves:

  • Batch mode: ~0.3-0.5x realtime (processes 60 seconds of audio in 20-30 seconds)
  • Streaming mode: ~0.5-0.7x realtime (processes audio with ~2-3 second latency)

The MLX implementation takes full advantage of the Neural Engine in Apple Silicon chips, providing significantly better performance than CPU-based implementations.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Recent Updates

  • New Python API: Added high-level realtime_mlx_stt package with STTClient, TranscriptionSession, and Transcriber
  • Interactive CLI: New user-friendly CLI at examples/cli.py for exploring all features
  • Dual API Architecture: Python API optimized for direct use, Server API for multi-client scenarios
  • Improved Examples: Consolidated examples with clear documentation
  • Architecture Documentation: Added comprehensive architecture documentation
  • OpenAI Integration: Support for OpenAI's transcription API as alternative to local MLX

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •