ViStreamASR is a simple Vietnamese Streaming Automatic Speech Recognition library for real-time audio processing.
- 🎯 Streaming ASR: Real-time audio processing with configurable chunk sizes
- 🇻🇳 Vietnamese Optimized: Specifically designed for Vietnamese speech recognition
- 📦 Simple API: Easy-to-use interface with minimal setup
- ⚡ High Performance: CPU/GPU support
pip install ViStreamASR
For development or to use the latest version:
# Clone the repository
git clone https://github.com/nguyenvulebinh/ViStreamASR.git
cd ViStreamASR
# Install dependencies
pip install -r requirements.txt
# Option 1: Use directly from source
python test_library.py # Test the installation
# Option 2: Install in development mode
pip install -e .
When using from source, import the modules directly:
import sys
sys.path.insert(0, 'src')
from streaming import StreamingASR
# Initialize and use
asr = StreamingASR()
for result in asr.stream_from_file("audio.wav"):
print(result['text'])
from ViStreamASR import StreamingASR
# Initialize ASR
asr = StreamingASR()
# Process audio file
for result in asr.stream_from_file("audio.wav"):
if result['partial']:
print(f"Partial: {result['text']}")
if result['final']:
print(f"Final: {result['text']}")
from ViStreamASR import StreamingASR
# Initialize ASR
asr = StreamingASR()
# Process audio file
for result in asr.stream_from_microphone(duration_seconds=10):
if result['partial']:
print(f"Partial: {result['text']}")
if result['final']:
print(f"Final: {result['text']}")
# Basic transcription
vistream-asr transcribe audio.wav
from ViStreamASR import StreamingASR
# Initialize with options
asr = StreamingASR(
chunk_size_ms=640, # Chunk size in milliseconds
auto_finalize_after=15.0, # Auto-finalize after seconds
debug=False # Enable debug logging
)
# Stream from file
for result in asr.stream_from_file("audio.wav"):
# result contains:
# - 'partial': True for partial results
# - 'final': True for final results
# - 'text': transcription text
# - 'chunk_info': processing information
pass
For low-level control:
from ViStreamASR import ASREngine
engine = ASREngine(chunk_size_ms=640, debug_mode=True)
engine.initialize_models()
# Process audio chunks directly
result = engine.process_audio(audio_chunk, is_last=False)
- Language: Vietnamese
- Architecture: U2-based streaming ASR
- Model Size: ~2.7GB (cached after first download)
- Sample Rate: 16kHz (automatically converted)
- Optimal Chunk Size: 640ms
The following picture shows how U2 (Unified Streaming and Non-streaming) architecture works:
The U2 model enables both streaming and non-streaming ASR in a unified framework, providing low-latency real-time transcription while maintaining high accuracy.
- RTF: ~0.34x (faster than real-time)
- Latency: ~640ms with default settings
- GPU Support: Automatic CUDA acceleration when available
- Audio Input Assumption: The system assumes audio input is speech. Non-speech audio may produce unexpected results.
- Production Recommendation: For practical use, it's recommended to add VAD (Voice Activity Detection) before running streaming ASR to reduce ASR streaming load and improve efficiency.
# Transcription
vistream-asr transcribe <file> # Basic transcription
vistream-asr transcribe <file> --chunk-size 640 # Custom chunk size
vistream-asr transcribe <file> --no-debug # Clean output
# Information
vistream-asr info # Library info
vistream-asr version # Version
- RAM: Minimum 5GB RAM
- CPU: Minimum 2 cores
- Performance: RTF 0.3-0.4x achievable on CPU-only systems meeting above specs
- GPU: Supports GPU acceleration for better performance, but CPU-only operation still achieves RTF 0.3-0.4x
- Python 3.8+
- PyTorch 2.5+
- TorchAudio 2.5+
- NumPy 1.19.0+
- Requests 2.25.0+
- flashlight-text
- librosa
MIT License