If you find this project useful, a star ⭐ on GitHub would be greatly appreciated!
Read the paper |
🌐 Online Demo
- [2025.02] 🔥 Online Demo is live — try it now!
- [2025.04] 🔥 OmniAudio paper is released on arXiv.
- [2025.05] 🎉 OmniAudio has been accepted by ICML 2025!
- [2025.05] 🔥 Released inference code and OmniAudio dataset.
- [2025.05] 📦 Released pretrained model weights and dataset on Hugging Face.
✨🔊 Transform your 360-degree videos into immersive spatial audio! 🌍🎶
PyTorch Implementation of OmniAudio, a model for generating spatial audio from 360-degree videos.
The checkpoints and the Sphere360 dataset are now publicly available on Hugging Face.
The overall architecture of OmniAudio is shown below:
Curious about the results? 🎧🌐
👉 Try our demo page here!
We provide an example of how you can perform inference using OmniAudio.
To run inference, follow these steps:
1️⃣ Navigate to the root directory. 📂
2️⃣ Create the Inference Environment.
To set up the environment, ensure you have Python >= 3.8.20 installed. Then, run the following commands:
pip install -r requirements.txt
pip install git+https://github.com/patrick-kidger/torchcubicspline.git
3️⃣ Run inference with the provided script:
bash demo.sh video_path cuda_id
💡 You can also modify demo.sh
to change the output directory. The cases
folder contains some sample 360-degree videos in the equirectangular format—make sure your videos follow the same format! 🎥✨
By default, the script will automatically download the pretrained model checkpoint from our HuggingFace repository if no custom checkpoint is specified.
If you wish to use your own trained model, you can modify demo.sh
to explicitly pass --ckpt-path
and point to your checkpoint directory.
We provide Sphere360, a large-scale, high-quality dataset of paired 360-degree video and spatial audio clips, specifically curated to support training and evaluation of spatial audio generation models like OmniAudio.
The dataset includes:
- Over 103,000 10-second clips
- 288 hours of total spatial content
- Paired equirectangular 360-degree video and first-order ambisonics (FOA) 4-channel audio (W, X, Y, Z)
To explore or use the dataset, follow these steps:
1️⃣ Navigate to the dataset folder:
cd Sphere360
2️⃣ Refer to the detailed usage guide in the README file: 📖 Sphere360 Dataset README
Inside the directory, you’ll find:
dataset/
: contains split configurations, metadata, and channel informationtoolset/
: crawling and cleaning tools for dataset constructiondocs/
: figures and documentation describing the pipeline
The dataset is splited as follows (see dataset/split/
):
- Training set: ~100.5k samples
- Test set: ~3k samples
- Each sample: 10 seconds of paired video and audio
The dataset was constructed via a two-stage crawling and filtering pipeline:
-
Crawling
- Uses the YouTube API
- Retrieves videos by channel and keyword-based queries
- Employs
yt-dlp
andFFmpeg
to download and process audio/video streams - Details: docs/crawl.md
-
Cleaning
-
Filters out content using the following criteria:
- Silent audio
- Static frames
- Audio-visual mismatches
- Human voice presence
-
Relies on models like ImageBind and SenseVoice
-
Details: docs/clean.md
-
- All videos are collected from YouTube under terms consistent with fair use for academic research.
- Videos under Creative Commons licenses are properly attributed.
- No video is used for commercial purposes.
- All channel metadata is recorded in
dataset/channels.csv
.
If OmniAudio contributes to your research or applications, we kindly ask you to cite it using the following BibTeX entry:
@misc{liu2025omniaudiogeneratingspatialaudio,
title={OmniAudio: Generating Spatial Audio from 360-Degree Video},
author={Huadai Liu and Tianyi Luo and Qikai Jiang and Kaicheng Luo and Peiwen Sun and Jialei Wan and Rongjie Huang and Qian Chen and Wen Wang and Xiangtai Li and Shiliang Zhang and Zhijie Yan and Zhou Zhao and Wei Xue},
year={2025},
eprint={2504.14906},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2504.14906},
}
💡 Have fun experimenting with OmniAudio! 🛠️💖