Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse sound categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.
- 2025-09: π DAVIS-Flow is accepted to IJCV!
- 2025-04: NEW: DAVIS-Flow released! Leveraging Flow Matching for faster training and better separation quality. Try it now!
- 2024-12: π DAVIS won ACCVβ24 Best Paper Award, Honorable Mention!
- 2024-09: DAVIS is accepted as ACCV 2024 Oral Presentation.
Create a conda environment and install dependencies:
git clone https://github.com/WikiChao/DAVIS.git
cd DAVIS
conda create --name DAVIS python=3.8
conda activate DAVIS
pip install -r requirements.txtor install the following libraries by yourself:
torch
torchvision
librosa
soundfile
clip
einops
tqdm
mir_eval
scipy
imageio- 
MUSIC Dataset: 
 Download from MUSIC Dataset GitHub.
- 
AVE Dataset: 
 Download from AVE Dataset GitHub.
Note: Some YouTube IDs in the MUSIC dataset are no longer valid. As a temporary solution, we will provide zipped data to help you get started: MUSIC Dataset Download.
Preprocess the videos according to your needs, ensuring the index files are consistent.
- Frame Extraction: Refer to ./preprocessing/extract_frames.py.
- Audio Extraction: Extract waveforms at 11,025 Hz. You can use ./preprocessing/extract_audio.py.
We provide .csv index files for training and testing.
The index files are located at:
- ./data/MUSICfor MUSIC
- ./data/AVEfor AVE
The directory structure for the datasets is as follows:
```
data
βββ audio
|   βββ acoustic_guitar
β   |   βββ M3dekVSwNjY.wav
β   |   βββ ...
β   βββ trumpet
β   |   βββ STKXyBGSGyE.wav
β   |   βββ ...
β   βββ ...
|
βββ frames
|   βββ acoustic_guitar
β   |   βββ M3dekVSwNjY.mp4
β   |   |   βββ 000001.jpg
β   |   |   βββ ...
β   |   βββ ...
β   βββ trumpet
β   |   βββ STKXyBGSGyE.mp4
β   |   |   βββ 000001.jpg
β   |   |   βββ ...
β   |   βββ ...
β   βββ ...
```
Modify /YOUR_ROOT to the directory you store the data in the ./dataset/ave.py and ./dataset/music.py, and also the YOUR_CKPT in the run.sh and run_ave.sh.
We provide a minimal example to launch the training. To get started, try running:
cd scripts
bash run.sh # for MUSIC dataset
or 
bash run_ave.sh # for AVE datasetTo launch the evaluation, modify the following arguments in run.sh or run_ave.sh to the following:
OPTS+="--split test "
OPTS+="--mode eval"DAVIS-Flow is our improved version that leverages flow matching techniques for faster training and improved separation quality.
Use the following scripts:
- For MUSIC dataset: run_fm.sh
- For AVE dataset: run_ave_fm.sh
Our pre-trained models are available for download. Use these models to quickly get started.
| Dataset | DAVIS | DAVIS-Flow | 
|---|---|---|
| MUSIC | Download | Download | 
| AVE | - | Download | 
All models are ready for inference using the evaluation scripts described in the previous sections.
We borrow code from the following repositories CCoL, diffusion-pytorch and iQuery.
If you use this code for your research, please cite the following work:
@article{huang2023davis,
  title={DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models},
  author={Huang, Chao and Liang, Susan and Tian, Yapeng and Kumar, Anurag and Xu, Chenliang},
  journal={arXiv preprint arXiv:2308.00122},
  year={2023}
}
or
@InProceedings{Huang_2024_ACCV,
    author    = {Huang, Chao and Liang, Susan and Tian, Yapeng and Kumar, Anurag and Xu, Chenliang},
    title     = {High-Quality Visually-Guided Sound Separation from Diverse Categories},
    booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
    month     = {December},
    year      = {2024},
    pages     = {35-49}
}
