Skip to content
/ DAVIS Public

[πŸ† IJCV 2025 & ACCV 2024 Best Paper Honorable Mention] Official pytorch implementation of the paper "High-Quality Visually-Guided Sound Separation from Diverse Categories"

Notifications You must be signed in to change notification settings

WikiChao/DAVIS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

High-Quality Visually-Guided Sound Separation from Diverse Categories

Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse sound categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.

If our project helps you, please give us a star ⭐ on GitHub to support us.

News

  • 2025-09: πŸš€ DAVIS-Flow is accepted to IJCV!
  • 2025-04: NEW: DAVIS-Flow released! Leveraging Flow Matching for faster training and better separation quality. Try it now!
  • 2024-12: πŸ† DAVIS won ACCV’24 Best Paper Award, Honorable Mention!
  • 2024-09: DAVIS is accepted as ACCV 2024 Oral Presentation.

Installation

Create a conda environment and install dependencies:

git clone https://github.com/WikiChao/DAVIS.git
cd DAVIS

conda create --name DAVIS python=3.8
conda activate DAVIS

pip install -r requirements.txt

or install the following libraries by yourself:

torch
torchvision
librosa
soundfile
clip
einops
tqdm
mir_eval
scipy
imageio

Dataset

1. Download Datasets

Note: Some YouTube IDs in the MUSIC dataset are no longer valid. As a temporary solution, we will provide zipped data to help you get started: MUSIC Dataset Download.


2. Preprocess Videos

Preprocess the videos according to your needs, ensuring the index files are consistent.

  • Frame Extraction: Refer to ./preprocessing/extract_frames.py.
  • Audio Extraction: Extract waveforms at 11,025 Hz. You can use ./preprocessing/extract_audio.py.

3. Data Splits

We provide .csv index files for training and testing.
The index files are located at:

  • ./data/MUSIC for MUSIC
  • ./data/AVE for AVE

4. Directory Structure

The directory structure for the datasets is as follows:

```
data
β”œβ”€β”€ audio
|   β”œβ”€β”€ acoustic_guitar
β”‚   |   β”œβ”€β”€ M3dekVSwNjY.wav
β”‚   |   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ trumpet
β”‚   |   β”œβ”€β”€ STKXyBGSGyE.wav
β”‚   |   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ ...
|
└── frames
|   β”œβ”€β”€ acoustic_guitar
β”‚   |   β”œβ”€β”€ M3dekVSwNjY.mp4
β”‚   |   |   β”œβ”€β”€ 000001.jpg
β”‚   |   |   β”œβ”€β”€ ...
β”‚   |   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ trumpet
β”‚   |   β”œβ”€β”€ STKXyBGSGyE.mp4
β”‚   |   |   β”œβ”€β”€ 000001.jpg
β”‚   |   |   β”œβ”€β”€ ...
β”‚   |   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ ...

```

Training

Modify /YOUR_ROOT to the directory you store the data in the ./dataset/ave.py and ./dataset/music.py, and also the YOUR_CKPT in the run.sh and run_ave.sh.

We provide a minimal example to launch the training. To get started, try running:

cd scripts

bash run.sh # for MUSIC dataset

or 

bash run_ave.sh # for AVE dataset

Evaluation

To launch the evaluation, modify the following arguments in run.sh or run_ave.sh to the following:

OPTS+="--split test "
OPTS+="--mode eval"

DAVIS-Flow

DAVIS-Flow is our improved version that leverages flow matching techniques for faster training and improved separation quality.

Training and Inference

Use the following scripts:

  • For MUSIC dataset: run_fm.sh
  • For AVE dataset: run_ave_fm.sh

Model Zoos

Our pre-trained models are available for download. Use these models to quickly get started.

Dataset DAVIS DAVIS-Flow
MUSIC Download Download
AVE - Download

All models are ready for inference using the evaluation scripts described in the previous sections.

Acknowledgements

We borrow code from the following repositories CCoL, diffusion-pytorch and iQuery.

Citation

If you use this code for your research, please cite the following work:

@article{huang2023davis,
  title={DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models},
  author={Huang, Chao and Liang, Susan and Tian, Yapeng and Kumar, Anurag and Xu, Chenliang},
  journal={arXiv preprint arXiv:2308.00122},
  year={2023}
}

or

@InProceedings{Huang_2024_ACCV,
    author    = {Huang, Chao and Liang, Susan and Tian, Yapeng and Kumar, Anurag and Xu, Chenliang},
    title     = {High-Quality Visually-Guided Sound Separation from Diverse Categories},
    booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
    month     = {December},
    year      = {2024},
    pages     = {35-49}
}

About

[πŸ† IJCV 2025 & ACCV 2024 Best Paper Honorable Mention] Official pytorch implementation of the paper "High-Quality Visually-Guided Sound Separation from Diverse Categories"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published