ECHO is a groundbreaking system designed to make online podcasts accessible for hearing-impaired individuals. This project identifies and visualizes the active speaker 🗣️, synchronizes subtitles 📝, and even provides Hindi translations 🇮🇳 for enhanced inclusivity.
🖋️ Presented at:
ACM 8th International Conference on Data Science and Management of Data (CODS COMAD)
📅 Date: December 18–21, 2024
📍 Location: IIT Jodhpur, India
- 🗣️ Speaker Identification: Detects the active speaker in a video using lip movement and audio analysis.
- 📝 Subtitle Synchronization: Automatically generates synchronized subtitles.
- 🎥 Multimodal Integration: Combines video (face and lip movement detection) with audio (speech transcription and speaker classification).
- 🌐 Hindi Translations: Automatically translates English subtitles into Hindi for better accessibility.
- 📂 Benchmark Dataset: Includes 500 annotated videos, diverse in accents and genres, tailored for speaker identification.
ECHO integrates state-of-the-art models for seamless functionality:
- Face Detection: MTCNN detects faces with high accuracy.
- Lip Movement Detection: LipNet analyzes lip movements for speaker identification.
- Speech Transcription: Powered by OpenAI’s Whisper.
- Speaker Embeddings: Extracted with Wav2Vec 2.0.
- Clustering: Groups speaker data for robust classification.
- Syncs video and audio streams in real-time for seamless output.
- Bounding boxes and color-coded speakers make the experience intuitive.
Figure: Block diagram of the ECHO architecture.
The system excels across multiple benchmarks:
- 🏆 Word Error Rate (WER): 5.3% – delivering accurate transcriptions.
- 🏆 Speaker Error Rate (SER): 9.2% – ensuring precise speaker classification.
- Integration of audio and video models improves accuracy by 8%.
- Incorporating lip detection reduces synchronization errors by 4%.
The dataset includes:
- 🎥 500 conversation videos, annotated with English subtitles in
.srt
format. - 🌐 Designed for diverse accents and genres.
- 📥 Download a sample of the dataset here.
- Python 3.8 or higher
- Required libraries:
torch
,transformers
,opencv-python
,librosa
,scikit-learn
- Clone the repository:
git clone https://github.com/your-repo/echo.git cd echo
- Install dependencies:
pip install -r requirements.txt
- Place input videos in the data/input directory.
- Process videos
python main.py --input data/input --output data/output
- Output videos with subtitles will be saved in the data/output directory. 🎉
The system supports multiple video genres:
- 🎙️ Talk shows: WER 9.0%, SER 10.3%
- 🎤 Interviews: WER 8.5%, SER 9.5%
- 🗳️ Political debates: WER 5.1%, SER 9.0%
If you use this work in your research, please cite:
@article{godhala2025echo,
title={ECHO: enhanced communication for hearing impaired in online video podcasts},
author={Godhala, Gouthami and Asam, Vijayasree and Sanyal, Samriddha},
journal={Discover Data},
volume={3},
number={1},
pages={32},
year={2025},
publisher={Springer}
}
This project was supported by the Centre for Interdisciplinary Artificial Intelligence (CAI), FLAME University.
- Gouthami Godhala
- Vijayasree Asam
- Samriddha Sanyal