Skip to content

adrianSRoman/SELDVisualSynth

Repository files navigation

SELDVisualSynth

arXiv Platform Python CC BY 4.0

SELDVisualSynth Visualization

SELDVisualSynth is a Python tool designed to generate synthetic visual mixtures tailored for the audio-visual DCASE Challenge Task 3. This tool creates 360-degree synthetic videos based on DCASE CSV metadata files, which provide per-frame information about sound event locations in 3D space. For each sound event specified in the metadata, SELDVisualSynth randomly selects a corresponding visual representation from a library of video and image assets. These assets are then spatially positioned in the video according to their specified coordinates, simulating the visual side of sounds in a dynamic and immersive way.


Table of Contents

  1. Installation
  2. Setup Steps
    1.1 Download 360-degree Image Canvas
    1.2 Download Image Assets
    1.3 Download Video Assets
  3. Usage Instructions
  4. Recommended Datasets Structure
  5. Datasets Summary
  6. Citation

Installation

Create a Virtual Environment (Python 3.8+ is recommended)

python3 -m venv pyenv

Install requirements

pip install -r requirements.txt

Setup Steps

Important

Please follow the steps below. Note that Step 2 and Step 3 require users to collect their own data and ensure that all collected images and videos are correctly categorized according to the 13 DCASE sound event classes. We recommend reviewing your dataset to confirm that the assets in each directory align with the corresponding category. (Refer to Sound event classes)

1 - Download 360-degree Image Canvas

Please download the 360-degree image assets to use as canvas/background for the video generation

Download

2 - Download Image Assets

Download the Flickr30k dataset.

To categorize the data into the 13 DCASE classes, execute:

python categorize_flickr30k.py
Click to expand

Modify the paths within the script to point to your downloaded dataset:

# Paths
metadata_file = "path/to/flickr30k_images/results.csv"  # Path to the Flickr30k metadata file
images_dir = "path/to/flickr30k_images/flickr30k_images"  # Path to the Flickr30k images directory
output_dir = "path/to/destination/flickr30k_images_per_class"  # Path to the output directory where images will be categorized

Note

Some classes, such as "Water tap, faucet," "Bell," and "Knock," may lack sufficient examples in the Flickr30k dataset. We recommend augmenting these categories by sourcing additional images online or from other datasets. Use the same categorization approach as described for Flickr30k.

3 - Download Video Assets

3.1 - Download Our Pre-recorded Videos

We provide some samples videos to illustrate the type of videos we use as video assets. These could be used for training, however, we recommend having more samples to achieve diverse visual synthesis. Refer to section 2 and 3.2.

Download pre-recorded videos

3.2 - YouTube Video Scraping

The script scrape_yt.py helps you find YouTube videos that match your specified sound event classes.

Features
Click to expand
  • Searches YouTube for videos matching 13 sound event classes.
  • Uses the YouTube Data API to perform searches.
  • Provides timestamps for each video (start and end).
  • Outputs results in CSV format.
  • Filters for shorter videos (under 10 minutes) for cleaner sound examples.
Setup Instructions
Click to expand
  1. Install required packages:

    pip install google-api-python-client google-auth-httplib2 google-auth-oauthlib pandas
  2. You'll need YouTube API credentials. You can either:

    • Use an API key (simpler but rate-limited)
    • Set up OAuth 2.0 authentication (more complex but higher quotas)
  3. For API key:

    • Go to the Google Cloud Console
    • Create a new project or select an existing one
    • Enable the YouTube Data API v3
    • Create an API key under "Credentials"
  4. For OAuth (if you don't specify an API key):

    • Download the OAuth client configuration file as client_secret.json
    • Place it in the same directory as the script
    • Follow the authorization prompts when running the script
Usage
python scrape_yt.py --api_key YOUR_API_KEY --results 5 --output youtube_sound_events.csv
Click to expand
Parameters:

`--api_key`: Your YouTube API key (optional if using OAuth)
`--results`: Number of results to fetch per class (default: 5)
`--output`: Output CSV file name (default: `youtube_sound_events.csv`)

The script will create two files:

- A CSV file with just the link, start, end, and class (matching your format)
- A detailed CSV that includes video titles and descriptions
Data download

Run the download script pointing to your generated YT csv file

python download.py

Finally: If you desire, combine both the pre-recorded videos and your downloaded videos into a unified directory structure as the one from Download Pre-recorded Videos.

Note

Some classes, such as "Footsteps," "Bell," "Knock," and "Music" may require manual inspection after downloding. Ideally you want the object playing the main role in a video, rather than a secondary role. Note you can adjust the start and end time in the csv file to trim the videos idurations as desired. Also, here you can get as creative as you want. For instance, you can record your own video scenes and adopt them as part of your visual data generation.

Usage Instructions

  1. Generate synthetic spatial audio data using SpatialScaper
  2. Define your configuration YAML file for the visual data generator.
    • Define input paths to video and image assets.
      • Important: include path to the metadata directory generated by SpatialScaper in Step 1.
    • Define output paths for the generated videos.
    • Define parameters for the visual generator (default is recommended)
    • Note: to start, we recommend only modifying the fields under input and output. The fields under processing may require understanding how these parameters change the visual synthesis. For the most part, the comments should explain what they do.

Execute SELD visual synthesizer by:

python visual_synth.py --config configs/visual_config.yaml

Recommended Datasets Structure

360-degree image backgrounds

image_360_path/
    ├── image1.jpg
    ├── image2.jpg
    ├── ...

360-degree video backgrounds (optional, but recommended)

video_360_path/
    ├── video1.mp4
    ├── video2.mp4
    ├── ...

Directory containing video assets by event class (video "tiles")

video_assets_dir/
    ├── Class_0/
    │   ├── video1.mp4
    │   ├── video2.mp4
    │   ├── ...
    ├── Class_1/
    │   ├── video1.mp4
    │   ├── video2.mp4
    │   ├── ...
    ├── ...
    ├── Class_12/
    │   ├── video1.mp4
    │   ├── video2.mp4
    │   ├── ...

Directory containing image assets by event class (image "tiles"). Both jpeg or png are supported.

image_assets_dir/
    ├── Class_0/
    │   ├── image1.jpeg
    │   ├── image2.png
    │   ├── ...
    ├── Class_1/
    │   ├── image1.jpeg
    │   ├── image2.png
    │   ├── ...
    ├── ...
    ├── Class_12/
    │   ├── image1.jpeg
    │   ├── image2.png
    │   ├── ...

Metadata directory containing metadata CSV files (DCASE-style metadata)

metadata_dir/
    ├── dev-train-synth/   # From SpatialScaper
    │   ├── file1.csv
    │   ├── ...

Datasets Summary

Dataset URL
360-degree Image Canvas (background) Link
Flickr30k Image Dataset (foreground) Link
Sample Pre-Recorded Videos (foreground) Link
YouTube Videos (foreground) Download using scrape_yt.py
SpatialScaper Simulated Audio Link

Citation

If you find our work useful, please cite our paper:

@article{roman2025generating,
  title={Generating Diverse Audio-Visual 360 Soundscapes for Sound Event Localization and Detection},
  author={Roman, Adrian S and Chang, Aiden and Meza, Gerardo and Roman, Iran R},
  journal={arXiv preprint arXiv:2504.02988},
  year={2025}
}
@inproceedings{roman2024spatial,
  title={Spatial Scaper: a library to simulate and augment soundscapes for sound event localization and detection in realistic rooms},
  author={Roman, Iran R and Ick, Christopher and Ding, Sivan and Roman, Adrian S and McFee, Brian and Bello, Juan P},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2024},
  organization={IEEE}
}

About

Generating Diverse Audio-Visual 360º Soundscapes for Sound Event Localization and Detection

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages