Skip to content

RuijieThranduil/soundscape-to-image

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


From Hearing to Seeing: Linking Auditory and Visual Place Perceptions with Soundscape-to-Image Generative Artificial Intelligence

framework

Media Coverage

Table of Contents

Citation

If you use this algorithm in your research or applications, please cite this source:

Zhuang, Y., Kang, Y., Fei, T., Bian, M. and Du, Y., 2024. From hearing to seeing: Linking auditory and visual place perceptions with soundscape-to-image generative artificial intelligence. Computers, Environment and Urban Systems, 110, p.102122. https://www.sciencedirect.com/science/article/abs/pii/S0198971524000516

@article{ZHUANG2024102122,
title = {From hearing to seeing: Linking auditory and visual place perceptions with soundscape-to-image generative artificial intelligence},
journal = {Computers, Environment and Urban Systems},
volume = {110},
pages = {102122},
year = {2024},
issn = {0198-9715},
doi = {https://doi.org/10.1016/j.compenvurbsys.2024.102122},
url = {https://www.sciencedirect.com/science/article/pii/S0198971524000516},
author = {Yonggai Zhuang and Yuhao Kang and Teng Fei and Meng Bian and Yunyan Du},
keywords = {Soundscape, Street view images, Sense of place, Stable diffusion, Generative AI, LLMs},
}

About The Project

People experience the world through multiple senses simultaneously, contributing to our sense of place. Prior quantitative geography studies have mostly emphasized human visual perceptions, neglecting human auditory perceptions at place due to the challenges in characterizing the acoustic environment vividly. Also, few studies have synthesized the two-dimensional (auditory and visual) perceptions in understanding human sense of place. To bridge these gaps, we propose a Soundscape-to-Image Stable Diffusion model, a generative Artificial Intelligence (AI) model supported by Large Language Models (LLMs), aiming to visualize soundscapes through the generation of street view images. By creating audio-image pairs, acoustic environments are first represented as high-dimensional semantic audio vectors. Our proposed Soundscape-to-Image Stable Diffusion model, which contains a Low-Resolution Diffusion Model and a Super-Resolution Diffusion Model, can then translate those semantic audio vectors into visual representations of place effectively. We evaluated our proposed model by using both machine-based and human-centered approaches and proved that the generated street view images align with our common perceptions, and accurately create several key street elements of the original soundscapes. It also demonstrates that soundscapes provide sufficient visual information places. This study stands at the forefront of the intersection between generative AI and human geography, demonstrating how human multi-sensory experiences can be linked. We aim to enrich geospatial data science and AI studies with human experiences. It has the potential to inform multiple domains such as human geography, environmental psychology, and urban design and planning, as well as advancing our knowledge of human-environment relationships.

framework

Environment

  1. Environment: Python 3.9 or newer

  2. Install dependency

  pip install -r requirements.txt

Download Models

We recommend using our pre-trained audio encoder. Please download wlc.pt before starting training or inference.

We also provide CEUS.pt, a pre-trained checkpoint of our Soundscape-to-image model trained on 7.5k audio and image pairs gathered from publicly available Youtube videos.

Data Management

  1. Place your training audio dataset and image dataset in the respective directories.
  2. Ensure that paired audio and image files share the same filename. Here’s an example:
|-- data
|   |-- audio
|      |-- example_1.wav
|      |-- example_2.wav
|   |-- image
|      |-- example_1.jpg
|      |-- example_2.jpg

Train Model

You can easily train the Soundscape-to-image model by run train.py. Our model consists of two U-Net networks, which need to be trained separately. Therefore, you must run train.py at least twice, once for each U-Net. Usage:

python train.py --train-image-path [PATH_TO_IMAGE_DATASET] --train-audio-path [PATH_TO_AUDIO_DATASET] --pre-trained-audio-encoder [PATH_TO_PRETRAINED_AUDIO_ENCODER] --checkpoint-path [PATH_TO_SAVE_CHECKPOINTS] --batch-size [BATCH_SIZE] --epochs [NUMBER_OF_EPOCHS] --lr-unet [LEARNING_RATE] --train-unet-number [UNET_MODEL_INDEX] --save-every [SAVE_FREQUENCY] --continue-unet-ckpt [PATH_TO_CHECKPOINT]

--train-image-path Path to the directory containing the training images.

--train-audio-path Path to the directory containing the training audio files.

--pre-trained-audio-encoder Path to the pre-trained audio encoder model file (e.g., wlc.pt).

--checkpoint-path Directory where model checkpoints will be saved.

--batch-size 5 Number of samples processed per batch during training.

--epochs 30 Total number of training epochs.

--lr-unet 1e-4 Learning rate for the U-Net models.

--train-unet-number 1 Specify which U-Net model to train (1 or 2).

--save-every 1 Frequency (in epochs) to save model checkpoints.

Example:

python train.py --train-image-path ./data/image --train-audio-path ./data/audio --pre-trained-audio-encoder ./wlc.pt --checkpoint-path ./checkpoints --batch-size 5 --epochs 30 --lr-unet 1e-4 --train-unet-number 1 --save-every 1

If training was interrupted, you can resume it using:

--continue-unet-ckpt ./checkpoints/imagen_1_10_epochs.pt

Example:

python train.py --train-image-path ./data/image --train-audio-path ./data/audio --pre-trained-audio-encoder ./wlc.pt --checkpoint-path ./checkpoints --batch-size 5 --epochs 30 --lr-unet 1e-4 --train-unet-number 1 --save-every 1 --continue-unet-ckpt ./checkpoints/imagen_1_10_epochs.pt

Inference

After training, you can test or apply the model by running sample.py.

Usage:

python inference.py --audio-enconder-ckpt [PATH_TO_AUDIO_ENCODER_CKPT] --unet-ckpt [PATH_TO_UNET_CKPT] --test-audio-path [PATH_TO_TEST_AUDIO] --test-image-path [PATH_TO_OUTPUT_IMAGES] --cond-scale [CONDITION_SCALE]
 
--audio-enconder-ckpt Path to the pre-trained audio encoder checkpoint (e.g., ./wlc.pt). 
--unet-ckpt  Path to the trained U-Net model checkpoint (e.g., ./checkpoints/imagen_1_30_epochs.pt). 
--test-audio-path Directory containing test audio files (e.g., ./test_audio). 
--test-image-path Directory where generated images will be saved (e.g., ./generated_images). 
--cond-scale Controls how strongly the model adheres to the audio features (default: 1.0). 

Example:

python inference.py --audio-enconder-ckpt ./wlc.pt --unet-ckpt ./checkpoints/imagen_1_30_epochs.pt --test-audio-path ./test_audio --test-image-path ./generated_images --cond-scale 1.0

Folder Structure

The following folders and files are key components of our project. Please do not delete or move them.

torchvggish → Our audio encoder used for extracting audio embeddings.

imagen_pytorch → Our image generation module, based on Imagen. We have modified some parameters, so you do not need to install the original Imagen separately.

project
|-- torchvggish
|-- imagen_pytorch
|-- train.py
|-- sample.py

Contact

Yonggai Zhuang: start128@163.com

Junbo Wang: bbojunbo@gmail.com

Albert Jiang: albertjiang@utexas.edu

Yuhao Kang: yuhao.kang@austin.utexas.edu

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Jupyter Notebook 0.4%