VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling 🎶

This project has been accepted to CVPR 2025! 🚀🚀🚀

This is the official repository for "VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling".

📺 Demo Video

✨ Abstract

In this work, we systematically study music generation conditioned solely on the video. First, we present the large-scale dataset V2M, which comprises 360K video-music pairs and includes various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment.

✨ Data Construction

Dataset Construction. To ensure data quality, V2M goes through rule-based coarse filtering and content-based fine-grained filtering. Music source separation is applied to remove speech and singing signals in the audio. After processing, human experts curate the benchmark subset, while the remaining data is used as the pretraining dataset. The pretrain data is then refined using Audio-Visual Alignment Ranking to select the finetuning dataset.

✨ Method

Overview of the VidMuse Framework.

🛠️ Environment Setup

Create Anaconda Environment: AudioCraft requires Python 3.9, PyTorch 2.1.0.

git clone https://github.com/ZeyueT/VidMuse.git; cd VidMuse

conda create -n VidMuse python=3.9
conda activate VidMuse
pip install git+https://github.com/ZeyueT/VidMuse.git

Install ffmpeg:

sudo apt-get install ffmpeg
# Or if you are using Anaconda or Miniconda
conda install "ffmpeg<5" -c conda-forge

🔮 Pretrained Weights

Please download the pretrained Audio Compression checkpoint compression_state_dict.bin and VidMuse model checkpoint state_dict.bin, put them into the directory './model'. (The VidMuse model is trained with our private dataset.)

mkdir -p model
wget https://huggingface.co/HKUSTAudio/VidMuse/resolve/main/compression_state_dict.bin -O model/compression_state_dict.bin
wget https://huggingface.co/HKUSTAudio/VidMuse/resolve/main/state_dict.bin -O model/state_dict.bin

🎞 Web APP

Use the gradio demo locally by running:
```
python -m demos.VidMuse_app --share
```
In the Gradio application, the Model Path field is used to specify the location of the model files. To correctly load the model, set the Model Path to './model'.

🔥 Training

Build data.jsonl file:
```
python egs/V2M/build_data_jsonl.py
```
Start training:
```
bash train.sh
```

📥 Importing / Exporting models

To export the trained model, use the following script:

import os
import torch
from audiocraft.utils import export
from audiocraft import train

# Define codec_model
codec_model = 'facebook/encodec_32khz'
xp = train.main.get_xp_from_sig('SIG')

model_save_path = './model'

# Export model
export.export_lm(xp.folder / 'checkpoint.th', model_save_path + '/state_dict.bin')
export.export_pretrained_compression_model(codec_model, model_save_path + '/compression_state_dict.bin')

🎯 Infer

Quick Start with Hugging Face: You can quickly start inference using the Hugging Face model hub. Refer to the VidMuse on Hugging Face for detailed instructions.
Local Inference: Before running the inference script, make sure to define the following parameters in infer.sh:
- model_path: Path to the model directory. This is where the model files are stored. Default is './model'.
- video_dir: Directory containing the videos for inference. This is where the input videos are located. Default is './dataset/example/infer'.
- output_dir: Directory where the output generated music will be saved. Default is './result/'.
Run the inference using the following script:
```
bash infer.sh
```

🧱 Dataset & Dataset Construction

The dataset has been released on Hugging Face.
Data construction details to be released...

🤗 Acknowledgement

Audiocraft: the codebase we built upon.

🚀 Citation

If you find our work useful, please consider citing:

@inproceedings{tian2025vidmuse,
  title={Vidmuse: A simple video-to-music generation framework with long-short-term modeling},
  author={Tian, Zeyue and Liu, Zhaoyang and Yuan, Ruibin and Pan, Jiahao and Liu, Qifeng and Tan, Xu and Chen, Qifeng and Xue, Wei and Guo, Yike},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={18782--18793},
  year={2025}
}

📭 Contact

If you have any comments or questions, feel free to contact Zeyue Tian(ztianad@connect.ust.hk), Zhaoyang Liu(zyliumy@gmail.com).

License

Please follow CC-BY-NC.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
audiocraft		audiocraft
config		config
dataset/example		dataset/example
demos		demos
egs/V2M		egs/V2M
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
infer.py		infer.py
infer.sh		infer.sh
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling 🎶

📺 Demo Video

✨ Abstract

✨ Data Construction

✨ Method

🛠️ Environment Setup

🔮 Pretrained Weights

🎞 Web APP

🔥 Training

📥 Importing / Exporting models

🎯 Infer

🧱 Dataset & Dataset Construction

🤗 Acknowledgement

🚀 Citation

📭 Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ZeyueT/VidMuse

Folders and files

Latest commit

History

Repository files navigation

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling 🎶

📺 Demo Video

✨ Abstract

✨ Data Construction

✨ Method

🛠️ Environment Setup

🔮 Pretrained Weights

🎞 Web APP

🔥 Training

📥 Importing / Exporting models

🎯 Infer

🧱 Dataset & Dataset Construction

🤗 Acknowledgement

🚀 Citation

📭 Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages