RAVEN: Official Repository of Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations
This is the official repository of the paper Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations, accepted at Interspeech 2025.
Clone this GitHub repo and run
git submodule update --init --recursive
Create a virtual environment:
conda create -y -n avse python=3.8
conda activate avse
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
❗ FIRST, change PROJECT_ROOT_PATH
in config.py
before proceeding.
Then run
export PYTHONPATH='/your/path/to/this_project'
in your terminal and change directory to src
folder.
We use VoxCeleb2 for our training. Please download the dataset and change the DATA_FOLDER_PATH
in config.py
to the folder path you saved the data to.
We also use MUSAN for creating our noisy input mixture. Please download the dataset and change the MUSAN_FOLDER_PATH
to the folder path you saved the MUSAN data to.
You can find our train/val/test split of VoxCeleb2 at src/data/split.parquet
, and the train/val/test split of MUSAN at src/data/musan_split.csv
.
Prior to extracting the pretrained embeddings of the dataset, clone the corresponding GitHub repo into the project root folder and download the checkpoints of the selected pretrained model. Follow the environment setup instructions in each pretrained model's Github README and then run the feature extractor script in the setup environment from src
folder.
Encoder Task | Encoder Name | GitHub Repo | Checkpoint Used | Feature Extractor Script |
---|---|---|---|---|
AVSR | VSRiW | link: save to /benchmarks/GRID/models/ |
GRID visual-only unseen WER=4.8 [src] | src/data/VSRiW_extract_visual_features.py |
AVSR | AVHuBERT Paper 1, 2 | link1: save to avhubert/conf/finetune/ |
base fine-tuned for VSR on LRS3-433h [src]2 | src/data/avhubert_extract_visual_features.py |
ASD | TalkNet | link: save to repo root folder | [src] | src/data/TalkNet_extract_visual_features.py |
ASD | LoCoNet | link: save to repo root folder | [src] | src/data/LoCoNet_extract_visual_features.py |
1 Follow instructions in the Github repo README then downgrade omegaconf==2.0.1
and hydra-core==1.0.0
; you need pip < 2.4.0
to install omegaconf==2.0.1
and you may also need numpy < 1.24
2 Go to the official model checkpoint page and sign the license agreement first.
For faster training and processing, convert all the m4a's to wav's. Run the below command to create the noisy input mixture, assuming you are at src
level.
python -W ignore utils/mix_speech_gpu.py
Run the below terminal command to start training the model. By default, logs and checkpoints will be saved to the CHECKPOINT_DIR defined in config.py. You can override parameters like visual encoder, batch size, and checkpoint directory using command-line arguments.
python -W ignore train.py
To resume training from a saved checkpoint, add the --train_from_checkpoint flag and specify the path using --ckpt_path:
python -W ignore train.py --train_from_checkpoint --ckpt_path=checkpoints/epoch-last.ckpt
First generate test input mixtures of different conditions and SNR scenarios, run
python -W ignore data/generate_test_data.py --condition=noise_only --snr=-10
If you want to generate different conditions and snrs at once, you could use comma to separate them. For example: --condition="noise_only, one_interfering_speaker, three_interfering_speakers" --snr="mixed, -10, -5, 0"
The mixture will be saved to /path/to/VoxCeleb2/dev/mixed_wav/{condition}/{snr}/
.
To evaluate a single visual encoder under a specific test condition and SNR, use test.py, which accepts command-line arguments for full flexibility:
python -W ignore test.py --visual_encoder=TalkNet --test_condition=noise_only --test_snr=-10