Speech-to-Text Translation System

A comprehensive audio translation system that combines speech recording, automatic speech recognition (ASR), and text translation capabilities to enable seamless translation between Vietnamese and English.

Overview

This project integrates three core modules:

Audio Recording Module: Captures audio from microphone input with support for both fixed-duration recording and smart silence detection.
ASR Module: Transcribes speech to text using the Whisper model (suzii/vi-whisper-large-v3-turbo-v1).
Translation Module: Translates text between Vietnamese and English using VINAI's neural machine translation models.

The system provides both programmatic APIs for developers and a user-friendly command-line interface for end users.

Features

Bidirectional Translation: Support for Vietnamese ↔ English translation
Real-time Audio Processing: Record and immediately transcribe/translate audio
Multiple Recording Modes: Fixed duration or automatic silence detection
Batch Processing: Process multiple audio files or text segments
Optimized Performance: Efficient model loading and inference
User-friendly CLI: Interactive command-line interface for all operations

Installation

Prerequisites

Python 3.8 or higher
CUDA-compatible GPU (recommended for faster processing)

Dependencies

pip install -r requirements.txt

Project Structure

├── translator_by_speech
│   ├── cli.py  # Command-line interface
│   ├── record.py   # Audio recording module
│   ├── speech_recognition.py   # ASR module and translation pipeline
│   ├── translator.py   # Text translation module
│   ├── pipeline.py   # Combine modules into the pipeline
├── recordings/               # Directory for stored audio recordings
└── transcripts/              # Directory for transcription and translation outputs

Usage

Command-Line Interface

The easiest way to use the system is through the provided CLI:

# Start interactive mode
python main.py

# Record 10 seconds of audio and translate
python main.py --record 10

# Process an existing audio file
python main.py --process recordings/sample.wav

# Change language direction (English to Vietnamese)
python main.py --source en --target vi

Interactive Commands

Once in the interactive mode, you can use these commands:

record [duration] - Record audio (with optional duration)
transcribe <file> - Transcribe an audio file
translate <text> - Translate text
process <file> - Process audio file (transcribe + translate)
speak - Record and process audio in one step
switch - Switch source and target languages
lang <src> <tgt> - Set source and target languages
status - Show current status
help - Show help information
exit - Exit the application

API Usage

You can also use the individual modules in your own Python code:

# Recording audio
from translator_by_speech.record import AudioRecorder
recorder = AudioRecorder()
audio_path = recorder.record(duration=5)  # Record for 5 seconds

# ASR (Speech to Text)
from translator_by_speech.speech_recognition import ASRModel
asr = ASRModel()
transcription = asr.transcribe_audio_file(audio_path)

# Translation
from translator_by_speech.translator import create_vi2en_translator
translator = create_vi2en_translator()
translation = translator.translate(transcription["text"])

# Complete Pipeline
from translator_by_speech.pipeline import SpeechTranslationPipeline
pipeline = SpeechTranslationPipeline()
result = pipeline.translate_speech_from_file(audio_path)
print(f"Original: {result['source_text']}")
print(f"Translation: {result['translated_text']}")

Models

This project uses the following AI models:

ASR: suzii/vi-whisper-large-v3-turbo-v1 (Vietnamese-optimized Whisper model)
Vietnamese to English: vinai/vinai-translate-vi2en-v2
English to Vietnamese: vinai/vinai-translate-en2vi-v2

Performance Considerations

The first run will download the models, which may take some time depending on your internet connection
Using a GPU significantly improves processing speed
ASR (speech recognition) is the most resource-intensive part of the pipeline

Limitations

Currently supports only Vietnamese and English
Accuracy may vary depending on audio quality and background noise
Large models require significant memory (especially for the ASR component)

Future Improvements

Add support for more languages
Implement streaming ASR for real-time translation
Create a graphical user interface
Optimize models for faster inference on CPU
Add support for batch processing of multiple files

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
tests		tests
translator_by_speech		translator_by_speech
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Speech-to-Text Translation System

Overview

Features

Installation

Prerequisites

Dependencies

Project Structure

Usage

Command-Line Interface

Interactive Commands

API Usage

Models

Performance Considerations

Limitations

Future Improvements

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

quochuy242/translator-by-speech

Folders and files

Latest commit

History

Repository files navigation

Speech-to-Text Translation System

Overview

Features

Installation

Prerequisites

Dependencies

Project Structure

Usage

Command-Line Interface

Interactive Commands

API Usage

Models

Performance Considerations

Limitations

Future Improvements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages