RAVEN: Official Repository of Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

This is the official repository of the paper Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations, accepted at Interspeech 2025.

Usage Instruction

Clone this GitHub repo and run

git submodule update --init --recursive

Create a virtual environment:

conda create -y -n avse python=3.8
conda activate avse
pip install -r requirements.txt
conda install -c conda-forge ffmpeg

❗ FIRST, change PROJECT_ROOT_PATH in config.py before proceeding.

Then run export PYTHONPATH='/your/path/to/this_project' in your terminal and change directory to src folder.

Data Preprocessing

We use VoxCeleb2 for our training. Please download the dataset and change the DATA_FOLDER_PATH in config.py to the folder path you saved the data to.

We also use MUSAN for creating our noisy input mixture. Please download the dataset and change the MUSAN_FOLDER_PATH to the folder path you saved the MUSAN data to.

You can find our train/val/test split of VoxCeleb2 at src/data/split.parquet, and the train/val/test split of MUSAN at src/data/musan_split.csv.

Extract Pretrained Visual Embeddings

Prior to extracting the pretrained embeddings of the dataset, clone the corresponding GitHub repo into the project root folder and download the checkpoints of the selected pretrained model. Follow the environment setup instructions in each pretrained model's Github README and then run the feature extractor script in the setup environment from src folder.

Encoder Task	Encoder Name	GitHub Repo	Checkpoint Used	Feature Extractor Script
AVSR	VSRiW	link: save to `/benchmarks/GRID/models/`	GRID visual-only unseen WER=4.8 [src]	`src/data/VSRiW_extract_visual_features.py`
AVSR	AVHuBERT Paper 1, 2	link¹: save to `avhubert/conf/finetune/`	base fine-tuned for VSR on LRS3-433h [src]²	`src/data/avhubert_extract_visual_features.py`
ASD	TalkNet	link: save to repo root folder	[src]	`src/data/TalkNet_extract_visual_features.py`
ASD	LoCoNet	link: save to repo root folder	[src]	`src/data/LoCoNet_extract_visual_features.py`

¹ Follow instructions in the Github repo README then downgrade omegaconf==2.0.1 and hydra-core==1.0.0; you need pip < 2.4.0 to install omegaconf==2.0.1 and you may also need numpy < 1.24

² Go to the official model checkpoint page and sign the license agreement first.

Simulate the Noisy Input Mixture

For faster training and processing, convert all the m4a's to wav's. Run the below command to create the noisy input mixture, assuming you are at src level.

python -W ignore utils/mix_speech_gpu.py

Training

Run the below terminal command to start training the model. By default, logs and checkpoints will be saved to the CHECKPOINT_DIR defined in config.py. You can override parameters like visual encoder, batch size, and checkpoint directory using command-line arguments.

python -W ignore train.py

To resume training from a saved checkpoint, add the --train_from_checkpoint flag and specify the path using --ckpt_path:

python -W ignore train.py --train_from_checkpoint --ckpt_path=checkpoints/epoch-last.ckpt

Evaluation

First generate test input mixtures of different conditions and SNR scenarios, run

python -W ignore data/generate_test_data.py --condition=noise_only --snr=-10

If you want to generate different conditions and snrs at once, you could use comma to separate them. For example: --condition="noise_only, one_interfering_speaker, three_interfering_speakers" --snr="mixed, -10, -5, 0"

The mixture will be saved to /path/to/VoxCeleb2/dev/mixed_wav/{condition}/{snr}/.

To evaluate a single visual encoder under a specific test condition and SNR, use test.py, which accepts command-line arguments for full flexibility:

python -W ignore test.py --visual_encoder=TalkNet --test_condition=noise_only --test_snr=-10

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
LoCoNet_ASD @ 68d90c8		LoCoNet_ASD @ 68d90c8
TalkNet_ASD @ 6d68214		TalkNet_ASD @ 6d68214
Visual_Speech_Recognition_for_Multiple_Languages @ 5e1405d		Visual_Speech_Recognition_for_Multiple_Languages @ 5e1405d
av_hubert @ 258fb50		av_hubert @ 258fb50
pages		pages
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAVEN: Official Repository of Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

Usage Instruction

Data Preprocessing

Extract Pretrained Visual Embeddings

Simulate the Noisy Input Mixture

Training

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Bose/RAVEN

Folders and files

Latest commit

History

Repository files navigation

RAVEN: Official Repository of Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

Usage Instruction

Data Preprocessing

Extract Pretrained Visual Embeddings

Simulate the Noisy Input Mixture

Training

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages