Realtime_mlx_STT

High-performance speech-to-text transcription library optimized exclusively for Apple Silicon. Leverages MLX framework for real-time on-device transcription with low latency.

⚠️ IMPORTANT: This library is designed for LOCAL USE ONLY on macOS with Apple Silicon. The included server is a development tool and should NOT be exposed to the internet or used in production environments without implementing proper security measures.

Features

Real-time transcription with low latency using MLX Whisper
Auto-stop after silence - Voice assistant-like "record until silence" behavior
Multiple APIs - Python API, REST API, and WebSocket for different use cases
Apple Silicon optimization using MLX with Neural Engine acceleration
Voice activity detection with WebRTC and Silero (configurable thresholds)
Wake word detection using Porcupine ("Jarvis", "Alexa", etc.)
OpenAI integration for cloud-based transcription alternative
Interactive CLI for easy exploration of features
Web UI with modern interface and real-time updates
Profile system for quick configuration switching
Event-driven architecture with command pattern
Thread-safe and production-ready

Language Selection

The Whisper large-v3-turbo model supports 99 languages with intelligent language detection:

Language-specific mode: When you select a specific language (e.g., Norwegian, French, Spanish), the model uses language-specific tokens that significantly improve transcription accuracy for that language
Multi-language capability: Even with a language selected, Whisper can still transcribe other languages if spoken - it's not restricted to only the selected language
Accuracy benefit: Selecting the primary language you'll be speaking provides much more accurate transcription compared to auto-detect mode
Auto-detect mode: When no language is specified, the model attempts to detect the language automatically, though with potentially lower accuracy

For example, if you select Norwegian (no) as your language:

Norwegian speech will be transcribed with high accuracy
English speech will still be transcribed correctly if spoken
The model uses the Norwegian language token (50288) to optimize for Norwegian

This behavior matches OpenAI's Whisper API - the language parameter guides but doesn't restrict the model.

Requirements

macOS with Apple Silicon (M1/M2/M3) - Required, not optional
Python 3.9+ (3.11+ recommended for best performance)
MLX for Apple Silicon optimization
PyAudio for audio capture
WebRTC VAD and Silero VAD for voice activity detection
Porcupine for wake word detection (optional)
Torch and NumPy for audio processing

Important Note: This library is specifically optimized for Apple Silicon and will not work on Intel-based Macs or other platforms. It requires the Neural Engine found in Apple Silicon chips to achieve optimal performance.

Installation

Install from PyPI (Recommended)

# Basic installation
pip install realtime-mlx-stt

# With OpenAI support for cloud transcription
pip install "realtime-mlx-stt[openai]"

# With development tools
pip install "realtime-mlx-stt[dev]"

# With server support for REST/WebSocket APIs
pip install "realtime-mlx-stt[server]"

# Install everything
pip install "realtime-mlx-stt[openai,server,dev]"

📚 Documentation

Usage Guide - Common patterns and troubleshooting
API Reference - Detailed API documentation
Examples - Working code examples

Install from Source

# Clone the repository
git clone https://github.com/kristofferv98/Realtime_mlx_STT.git
cd Realtime_mlx_STT

# Set up Python environment (requires Python 3.9+ but 3.11+ recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e .

Quick Start

Interactive CLI (Recommended)

The easiest way to explore all features:

python examples/cli.py

This provides a menu-driven interface for:

Quick 10-second transcription
Continuous streaming mode
OpenAI cloud transcription
Wake word detection
Audio device selection
Language configuration

Python API

from realtime_mlx_stt import STTClient

# Single utterance transcription (most common use case)
client = STTClient()
text = client.transcribe_utterance()
print(f"You said: {text}")

# Configure VAD settings at client level
client = STTClient(
    vad_sensitivity=0.6,
    vad_min_silence_duration=2.0,
    default_language="en"
)

# Voice command pattern (auto-stops after silence)
for result in client.transcribe():  # auto_stop_after_silence=True by default
    print(result.text)

# Continuous streaming (no auto-stop)
with client.stream() as stream:  # auto_stop_after_silence=False by default
    for result in stream:
        print(result.text)
        if "stop" in result.text.lower():
            break

# Voice assistant pattern
while True:
    text = client.transcribe_utterance()
    if "quit" in text.lower():
        break
    print(f"Command: {text}")

# With OpenAI
client = STTClient(openai_api_key="sk-...")
text = client.transcribe_utterance(engine="openai")

# Wake word mode
client.start_wake_word("jarvis")

Server Mode

Security Note: The server is for local development only and binds to localhost by default. Do NOT expose it to the internet without proper authentication and security measures.

# Start server (localhost only - safe)
cd example_server
python server_example.py

# Opens web UI at http://localhost:8000

Auto-Stop After Silence

The library now supports automatic stopping after silence detection, providing voice assistant-like behavior where transcription automatically ends after a period of silence.

Key Features

Configurable silence timeout: Set how long to wait before stopping (default: 2.0 seconds)
Per-call override: Enable/disable auto-stop for specific transcription calls
Multiple API support: Works with transcribe(), stream(), and transcribe_until_silence()
Convenience method: transcribe_until_silence() for simple use cases

Usage Examples

from realtime_mlx_stt import STTClient

# 1. Single utterance (most common) - NEW!
client = STTClient()
text = client.transcribe_utterance()
print(f"You said: {text}")

# 2. Configure VAD at client level - NEW!
client = STTClient(
    vad_sensitivity=0.6,
    vad_min_silence_duration=2.0,
    default_language="en"
)

# 3. Fast startup mode (reduces ~500ms to <100ms) - NEW!
client = STTClient(fast_start=True)
text = client.transcribe_utterance()  # Starts much faster

# 4. Ultra-fast recording (starts in <50ms) - NEW!
client = STTClient(fast_start=True)
client.start_recording_immediate()  # Starts in <50ms
client.wait_for_ready()             # Wait for models to load
text = client.get_transcription()   # Get the result

# 5. Voice command pattern (auto-stops by default)
for result in client.transcribe():
    print(result.text)

# 6. Continuous streaming (no auto-stop by default)
with client.stream() as stream:
    for result in stream:
        print(result.text)
        if "stop" in result.text.lower():
            break

# 7. Voice assistant pattern - simplified
while True:
    text = client.transcribe_utterance()
    if "quit" in text.lower():
        break
    print(f"Command: {text}")

Configuration Options

vad_sensitivity: Voice activity detection sensitivity (0.0-1.0, default: 0.5)
vad_min_silence_duration: Minimum silence duration to end speech (seconds, default: 2.0)
vad_min_speech_duration: Minimum speech duration to start transcription (seconds, default: 0.25)
auto_stop_after_silence: Enable/disable auto-stop behavior (default: False for client, True for transcribe(), False for stream())
silence_timeout: Override silence timeout (uses vad_min_silence_duration if None)
max_duration: Maximum recording duration for safety (default: 30.0 for utterance, 60.0 for others)

Performance Optimization

fast_start: Enable fast startup mode (reduces ~500ms to <100ms, default: False)

Environment Variables:

PRELOAD_STT_MODELS=true: Pre-load models at import time for even faster startup

Architecture

The library provides two specialized interfaces built on a common Features layer:

┌─────────────────────────────────────────────────┐
│          User Interfaces                         │
│  • CLI (examples/cli.py)                        │
│  • Web UI (example_server/)                     │
├─────────────────────────────────────────────────┤
│          API Layers                             │
│  • Python API (realtime_mlx_stt/)              │
│  • REST/WebSocket (src/Application/Server/)    │
├─────────────────────────────────────────────────┤
│          Features Layer                         │
│  • AudioCapture                                │
│  • VoiceActivityDetection                      │
│  • Transcription (MLX/OpenAI)                  │
│  • WakeWordDetection                           │
├─────────────────────────────────────────────────┤
│          Core & Infrastructure                  │
│  • Command/Event System                         │
│  • Logging & Configuration                      │
└─────────────────────────────────────────────────┘

Key Design Principles

Vertical Slice Architecture: Each feature is self-contained with Commands, Events, Handlers, and Models
Dual API Design: Python API optimized for direct use, Server API optimized for multi-client scenarios
Event-Driven: Features communicate via commands and events, not direct dependencies
Production Ready: Thread-safe, lazy initialization, comprehensive error handling

API Documentation

Python API (realtime_mlx_stt)

from realtime_mlx_stt import STTClient, TranscriptionSession, create_transcriber

# Method 1: Modern Client API
client = STTClient(
    openai_api_key="sk-...",     # Optional
    default_engine="mlx_whisper", # or "openai"
    default_language="en"         # or None for auto-detect
)

# Transcribe for fixed duration
for result in client.transcribe(duration=10):
    print(f"{result.text} (confidence: {result.confidence})")

# Streaming with stop word
with client.stream() as stream:
    for result in stream:
        print(result.text)
        if "stop" in result.text.lower():
            break

# Method 2: Session-based API
from realtime_mlx_stt import TranscriptionSession, ModelConfig, VADConfig

session = TranscriptionSession(
    model=ModelConfig(engine="mlx_whisper", language="no"),
    vad=VADConfig(sensitivity=0.8),
    on_transcription=lambda r: print(r.text)
)

with session:
    time.sleep(30)  # Listen for 30 seconds

# Method 3: Simple Transcriber
from realtime_mlx_stt import Transcriber
transcriber = Transcriber(language="es")
text = transcriber.transcribe_from_mic(duration=5)
print(f"You said: {text}")

REST API

# Start system with profile
curl -X POST http://localhost:8000/api/v1/system/start \
  -H "Content-Type: application/json" \
  -d '{
    "profile": "vad-triggered",
    "custom_config": {
      "transcription": {"language": "fr"},
      "vad": {"sensitivity": 0.7}
    }
  }'

# Get system status
curl http://localhost:8000/api/v1/system/status

# Transcribe audio file
curl -X POST http://localhost:8000/api/v1/transcription/audio \
  -H "Content-Type: application/json" \
  -d '{"audio_data": "base64_encoded_audio_data"}'

WebSocket Events

const ws = new WebSocket('ws://localhost:8000/events');

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    
    switch(data.type) {
        case 'transcription':
            if (data.is_final) {
                console.log(`Final: ${data.text}`);
            } else {
                console.log(`Transcribing: ${data.text}`);
            }
            break;
        case 'wake_word':
            console.log(`Wake word: ${data.wake_word}`);
            break;
    }

Configuration

Environment Variables

# API Keys
export OPENAI_API_KEY="sk-..."        # For OpenAI transcription
export PORCUPINE_ACCESS_KEY="..."     # For wake word detection
# Alternative names for Picovoice universal key (same as PORCUPINE_ACCESS_KEY):
# export PICOVOICE_ACCESS_KEY="..."
# export PICOVOICE_API_KEY="..."

# Logging
export LOG_LEVEL="INFO"               # DEBUG, INFO, WARNING, ERROR
export LOG_FORMAT="human"             # human, json, detailed

Python Configuration

from realtime_mlx_stt import ModelConfig, VADConfig, WakeWordConfig

# Model configuration
model = ModelConfig(
    engine="mlx_whisper",        # or "openai"
    model="whisper-large-v3-turbo",
    language="en"                # or None for auto-detect
)

# VAD configuration
vad = VADConfig(
    enabled=True,
    sensitivity=0.6,             # 0.0-1.0
    min_speech_duration=0.25,    # seconds
    min_silence_duration=0.1     # seconds
)

# Wake word configuration
# Note: Requires PORCUPINE_ACCESS_KEY environment variable
wake_word = WakeWordConfig(
    words=["jarvis", "computer"],
    sensitivity=0.7,
    timeout=30                   # seconds
)

## Testing

The project includes comprehensive tests for each feature and component:

```bash
# Run all tests
python tests/run_tests.py

# Run tests for a specific feature or component
python tests/run_tests.py -f VoiceActivityDetection
python tests/run_tests.py -f Infrastructure
python tests/run_tests.py -f Application  # Server/Client tests

# Run a specific test with verbose output
python tests/run_tests.py -t webrtc_vad_test -v
python tests/run_tests.py -t test_server_module -v

# Test with PYTHONPATH (if imports fail)
PYTHONPATH=/path/to/Realtime_mlx_STT python tests/run_tests.py

The Server implementation includes tests for:

API Controllers (Transcription and System)
WebSocket connections and event broadcasting
Configuration and profile management
Command/Event integration

Performance

On Apple Silicon (M1/M2/M3), the MLX-optimized Whisper-large-v3-turbo model typically achieves:

Batch mode: ~0.3-0.5x realtime (processes 60 seconds of audio in 20-30 seconds)
Streaming mode: ~0.5-0.7x realtime (processes audio with ~2-3 second latency)

The MLX implementation takes full advantage of the Neural Engine in Apple Silicon chips, providing significantly better performance than CPU-based implementations.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Recent Updates

New Python API: Added high-level realtime_mlx_stt package with STTClient, TranscriptionSession, and Transcriber
Interactive CLI: New user-friendly CLI at examples/cli.py for exploring all features
Dual API Architecture: Python API optimized for direct use, Server API for multi-client scenarios
Improved Examples: Consolidated examples with clear documentation
Architecture Documentation: Added comprehensive architecture documentation
OpenAI Integration: Support for OpenAI's transcription API as alternative to local MLX

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

OpenAI Whisper for the base Whisper large-v3-turbo model
MLX for Apple Silicon optimization
RealtimeSTT for the original audio processing concepts
Picovoice Porcupine for wake word detection
Hugging Face for model distribution infrastructure

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
example_server		example_server
realtime_mlx_stt		realtime_mlx_stt
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
USAGE_GUIDE.md		USAGE_GUIDE.md
mel_filters.npz		mel_filters.npz
pyproject.toml		pyproject.toml
test_model_download.py		test_model_download.py

License

kristofferv98/Realtime_mlx_STT

Folders and files

Latest commit

History

Repository files navigation

Realtime_mlx_STT

Features

Language Selection

Requirements

Installation

Install from PyPI (Recommended)

📚 Documentation

Install from Source

Quick Start

Interactive CLI (Recommended)

Python API

Server Mode

Auto-Stop After Silence

Key Features

Usage Examples

Configuration Options

Performance Optimization

Architecture

Key Design Principles

API Documentation

Python API (realtime_mlx_stt)

REST API

WebSocket Events

Configuration

Environment Variables

Python Configuration

Performance

Contributing

Recent Updates

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages