Skip to content

FAN-Wav2Lip enhances lip synchronization in English-to-Hindi dubbed videos by integrating Wav2Lip for audio-visual alignment, FAN for precise lip landmarks, and Real-ESRGAN for video quality enhancement. Trained on a benchmark Hindi-dubbed dataset, it outperforms SOTA methods in lip-sync accuracy and visual clarity, making dubbed content realistic.

Notifications You must be signed in to change notification settings

gouthamireddy2507/FAN-Wav2Lip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FAN-Wav2Lip: A Lip Synchronization Model for Hindi To English Dubbing

FAN-Wav2Lip integrates the Wav2Lip model with a Facial Alignment Network (FAN) to improve lip synchronization in dubbed videos. The model generates highly synchronized lip movements for Hindi audio while enhancing video quality using Real-ESRGAN for super-resolution.

Lip Synchronization Overview

Given a Hindi audio and target English video, our model produces the output video with lip-synchronization for the Hindi language.


🏆 Key Contributions

  1. Benchmark Dataset: Hindi-dubbed videos paired with high-quality annotations for lip synchronization evaluation.
  2. FAN-Wav2Lip Model: Combines Wav2Lip's audio-visual alignment capabilities with FAN for precise lip landmark detection.
  3. Video Quality Enhancement: Incorporates Real-ESRGAN to enhance visual quality for a seamless viewing experience.

📂 Dataset

A curated dataset (HRS) of 100 video clips with diverse linguistic inputs, Hindi subtitles (.srt format), and annotations. Download Dataset

🛠 Model Architecture

1. Lip Synchronization

  • Input: Hindi audio and English video.
  • Uses FAN for detailed lip landmark extraction and Wav2Lip for audio-to-lip alignment.

2. Video Quality Enhancement

  • Real-ESRGAN enhances resolution and reduces artifacts.
  • Ensures visually appealing and coherent outputs.

Model Architecture

Block diagram of the intricate details of the proposed model. (a) The Lip landmarks are extracted from the video frame. (b) The Audio features are extracted from the input audio signal. (c) The synchronized lip movements with the audio throughout the video are generated from the concatenated input of audio and visual features. (d) The input low-resolution frames are processed through convolutional layers that capture spatial features to improve quality. (e) The residuals are calculated by taking the difference between low-resolution frames and desired high-resolution frames and missing fine details are enhanced. (f) Unnecessary artifacts are removed from the frames to ensure the output is close to the high-resolution target. (g) The final enhanced lip sync video in Hindi.


🚀 Performance

Quantitative Metrics

Model RLMD LSE SSIM
Wav2Lip 0.404 0.417 0.674
Diff2Lip 0.529 0.447 0.598
FAN-Wav2Lip 0.233 0.389 0.694

Qualitative Evaluation

We conducted a qualitative evaluation where participants rated results produced by our model on a scale of 0-5 on:

  1. Lip-Sync Naturalness: How well lip movements match audio.
  2. Visual Coherence: The smoothness and clarity of the video.
Set ID Lip Sync Naturalness Visual Coherence
1 4.1 4.2
2 4.0 4.1
3 4.2 4.3
4 4.0 4.1
5 4.3 4.4

Visual Comparison

The outputs of FAN-Wav2Lip closely resemble the ground truth, outperforming other models:

Visual Comparison

Results of lip synchronization generated by various state-of-the-art models using our dataset: (a) Ground truth, (b) Wav2Lip, (c) Diff2Lip, (d) TalkLip, and (e) Proposed approach. As shown, the results produced by our model (e) closely resemble the ground truth (a) compared to the other models.


📊 Ablation Study

  • Removing Real-ESRGAN results in a 5-10% drop in PSNR and SSIM scores.
  • FAN integration improves RLMD and LSE by 20%, highlighting the importance of lip landmark accuracy.

🔧 Setup & Installation

  1. Clone the repository:
    git clone https://github.com/gouthamireddy2507/FAN-Wav2Lip.git
  2. Install dependencies:
    pip install -r requirements.txt
    
  3. Download pretrained models and place them in the pretrained directory.

🗂 Pretrained Models

Download pretrained models used in this project:

  1. Wav2Lip
    Repository: https://github.com/Rudrabha/Wav2Lip
    Download Pretrained Model: Direct Link

  2. Facial Alignment Network (FAN)
    Repository: https://github.com/1adrianb/face-alignment
    Download Pretrained Model: Direct Link

  3. Real-ESRGAN
    Repository: https://github.com/xinntao/Real-ESRGAN
    Download Pretrained Model: Direct Link

Place all downloaded weights into the pretrained/ directory to ensure the project runs seamlessly.

🎥 Demo

Experience the lip-synced outputs:


🤝 Citation

If you use this code, please cite our work:

@article{Sanyal2024FANWav2Lip,
  title={FAN-Wav2Lip: A Lip Synchronization Model for English to Hindi Dubbing},
  author={Samriddha Sanyal, Gouthami Godhala, Vijayasree Asam, Pankaj Jain},
  journal={Preprint submitted to Journal of Visual Communication and Image Representation},
  year={2024}
}

About

FAN-Wav2Lip enhances lip synchronization in English-to-Hindi dubbed videos by integrating Wav2Lip for audio-visual alignment, FAN for precise lip landmarks, and Real-ESRGAN for video quality enhancement. Trained on a benchmark Hindi-dubbed dataset, it outperforms SOTA methods in lip-sync accuracy and visual clarity, making dubbed content realistic.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages