Skip to content

Bose/RAVEN

Repository files navigation

RAVEN: Official Repository of Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

This is the official repository of the paper Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations, accepted at Interspeech 2025.

Usage Instruction

Clone this GitHub repo and run

git submodule update --init --recursive

Create a virtual environment:

conda create -y -n avse python=3.8
conda activate avse
pip install -r requirements.txt
conda install -c conda-forge ffmpeg

❗ FIRST, change PROJECT_ROOT_PATH in config.py before proceeding.

Then run export PYTHONPATH='/your/path/to/this_project' in your terminal and change directory to src folder.

Data Preprocessing

We use VoxCeleb2 for our training. Please download the dataset and change the DATA_FOLDER_PATH in config.py to the folder path you saved the data to.

We also use MUSAN for creating our noisy input mixture. Please download the dataset and change the MUSAN_FOLDER_PATH to the folder path you saved the MUSAN data to.

You can find our train/val/test split of VoxCeleb2 at src/data/split.parquet, and the train/val/test split of MUSAN at src/data/musan_split.csv.

Extract Pretrained Visual Embeddings

Prior to extracting the pretrained embeddings of the dataset, clone the corresponding GitHub repo into the project root folder and download the checkpoints of the selected pretrained model. Follow the environment setup instructions in each pretrained model's Github README and then run the feature extractor script in the setup environment from src folder.

Encoder Task Encoder Name GitHub Repo Checkpoint Used Feature Extractor Script
AVSR VSRiW link: save to /benchmarks/GRID/models/ GRID visual-only unseen WER=4.8 [src] src/data/VSRiW_extract_visual_features.py
AVSR AVHuBERT Paper 1, 2 link1: save to avhubert/conf/finetune/ base fine-tuned for VSR on LRS3-433h [src]2 src/data/avhubert_extract_visual_features.py
ASD TalkNet link: save to repo root folder [src] src/data/TalkNet_extract_visual_features.py
ASD LoCoNet link: save to repo root folder [src] src/data/LoCoNet_extract_visual_features.py

1 Follow instructions in the Github repo README then downgrade omegaconf==2.0.1 and hydra-core==1.0.0; you need pip < 2.4.0 to install omegaconf==2.0.1 and you may also need numpy < 1.24

2 Go to the official model checkpoint page and sign the license agreement first.

Simulate the Noisy Input Mixture

For faster training and processing, convert all the m4a's to wav's. Run the below command to create the noisy input mixture, assuming you are at src level.

python -W ignore utils/mix_speech_gpu.py

Training

Run the below terminal command to start training the model. By default, logs and checkpoints will be saved to the CHECKPOINT_DIR defined in config.py. You can override parameters like visual encoder, batch size, and checkpoint directory using command-line arguments.

python -W ignore train.py

To resume training from a saved checkpoint, add the --train_from_checkpoint flag and specify the path using --ckpt_path:

python -W ignore train.py --train_from_checkpoint --ckpt_path=checkpoints/epoch-last.ckpt

Evaluation

First generate test input mixtures of different conditions and SNR scenarios, run

python -W ignore data/generate_test_data.py --condition=noise_only --snr=-10

If you want to generate different conditions and snrs at once, you could use comma to separate them. For example: --condition="noise_only, one_interfering_speaker, three_interfering_speakers" --snr="mixed, -10, -5, 0"

The mixture will be saved to /path/to/VoxCeleb2/dev/mixed_wav/{condition}/{snr}/.

To evaluate a single visual encoder under a specific test condition and SNR, use test.py, which accepts command-line arguments for full flexibility:

python -W ignore test.py --visual_encoder=TalkNet --test_condition=noise_only --test_snr=-10

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages