A comprehensive audio translation system that combines speech recording, automatic speech recognition (ASR), and text translation capabilities to enable seamless translation between Vietnamese and English.
This project integrates three core modules:
- Audio Recording Module: Captures audio from microphone input with support for both fixed-duration recording and smart silence detection.
- ASR Module: Transcribes speech to text using the Whisper model (
suzii/vi-whisper-large-v3-turbo-v1
). - Translation Module: Translates text between Vietnamese and English using VINAI's neural machine translation models.
The system provides both programmatic APIs for developers and a user-friendly command-line interface for end users.
- Bidirectional Translation: Support for Vietnamese ↔ English translation
- Real-time Audio Processing: Record and immediately transcribe/translate audio
- Multiple Recording Modes: Fixed duration or automatic silence detection
- Batch Processing: Process multiple audio files or text segments
- Optimized Performance: Efficient model loading and inference
- User-friendly CLI: Interactive command-line interface for all operations
- Python 3.8 or higher
- CUDA-compatible GPU (recommended for faster processing)
pip install -r requirements.txt
├── translator_by_speech
│ ├── cli.py # Command-line interface
│ ├── record.py # Audio recording module
│ ├── speech_recognition.py # ASR module and translation pipeline
│ ├── translator.py # Text translation module
│ ├── pipeline.py # Combine modules into the pipeline
├── recordings/ # Directory for stored audio recordings
└── transcripts/ # Directory for transcription and translation outputs
The easiest way to use the system is through the provided CLI:
# Start interactive mode
python main.py
# Record 10 seconds of audio and translate
python main.py --record 10
# Process an existing audio file
python main.py --process recordings/sample.wav
# Change language direction (English to Vietnamese)
python main.py --source en --target vi
Once in the interactive mode, you can use these commands:
record [duration]
- Record audio (with optional duration)transcribe <file>
- Transcribe an audio filetranslate <text>
- Translate textprocess <file>
- Process audio file (transcribe + translate)speak
- Record and process audio in one stepswitch
- Switch source and target languageslang <src> <tgt>
- Set source and target languagesstatus
- Show current statushelp
- Show help informationexit
- Exit the application
You can also use the individual modules in your own Python code:
# Recording audio
from translator_by_speech.record import AudioRecorder
recorder = AudioRecorder()
audio_path = recorder.record(duration=5) # Record for 5 seconds
# ASR (Speech to Text)
from translator_by_speech.speech_recognition import ASRModel
asr = ASRModel()
transcription = asr.transcribe_audio_file(audio_path)
# Translation
from translator_by_speech.translator import create_vi2en_translator
translator = create_vi2en_translator()
translation = translator.translate(transcription["text"])
# Complete Pipeline
from translator_by_speech.pipeline import SpeechTranslationPipeline
pipeline = SpeechTranslationPipeline()
result = pipeline.translate_speech_from_file(audio_path)
print(f"Original: {result['source_text']}")
print(f"Translation: {result['translated_text']}")
This project uses the following AI models:
- ASR:
suzii/vi-whisper-large-v3-turbo-v1
(Vietnamese-optimized Whisper model) - Vietnamese to English:
vinai/vinai-translate-vi2en-v2
- English to Vietnamese:
vinai/vinai-translate-en2vi-v2
- The first run will download the models, which may take some time depending on your internet connection
- Using a GPU significantly improves processing speed
- ASR (speech recognition) is the most resource-intensive part of the pipeline
- Currently supports only Vietnamese and English
- Accuracy may vary depending on audio quality and background noise
- Large models require significant memory (especially for the ASR component)
- Add support for more languages
- Implement streaming ASR for real-time translation
- Create a graphical user interface
- Optimize models for faster inference on CPU
- Add support for batch processing of multiple files
This project is released under the MIT License.