This project showcases an Advanced Speech Recognition (ASR) System that transcribes audio to text in real time using OpenAI's Whisper AI model. With support for multilingual transcription, noise robustness, and a user-friendly web interface, this tool offers efficient voice-to-text applications for transcription, dictation, and voice-based user interfaces.
Voice-driven technologies are revolutionizing industries like healthcare, customer service, and accessibility. Existing systems often face challenges in noisy environments, handling accents, or processing multiple languages. This project addresses these gaps by leveraging the Whisper model's versatility and accuracy to build a scalable, real-time ASR system.
Traditional speech recognition systems are hindered by:
- 🚫 High latency
- ❌ Limited accuracy in noisy environments
- 🌐 Poor multilingual support
- 🔒 Privacy concerns due to reliance on cloud services
This project tackles these issues, aiming for real-time, accurate, and multilingual transcription, all while maintaining robust privacy measures.
- Real-time Transcription: Accurate audio-to-text conversion.
- Multilingual Support: Dynamic recognition of multiple languages.
- Noise Robustness: Enhanced transcription in noisy environments.
- User-Friendly GUI: Intuitive web interface for seamless interaction.
- Enhanced real-time conversion for time-sensitive applications.
- Support for additional use cases like subtitles and live translations.
-
Input and Preprocessing:
- Context-aware transcription using past tokens.
- Starts transcription pipeline upon receiving audio data.
-
Language and Speech Detection:
- Automatic language identification for multilingual support.
- Voice activity detection to filter non-speech audio.
-
Transcription and Translation:
- Time-aligned Transcription: With timestamps for precise indexing.
- Text-only Transcription: Continuous text output.
- Optional translation from non-English to English.
-
Output Generation:
- Generates time-stamped or continuous text.
- Marks transcription completion per audio segment.
- Streamlit: Web-based user interface.
- Whisper AI: Speech-to-text engine.
Deployment:
- Platform: Streamlit Cloud
- Deployment Link: Hertz ASR App
- High-speed internet for real-time services.
- Scalable cloud-based infrastructure.
git clone https://github.com/AE-Hertz/speech-to-text/
cd advanced-speech-recognition
pip install -r requirements.txt
streamlit run app.py
We welcome contributions to improve this project! Please:
- Fork the repository.
- Create a feature branch.
- Submit a pull request.
This project is licensed under the MIT License.
- OpenAI for the Whisper model.
- Streamlit for the web development framework.
Happy coding! 😊