From Hearing to Seeing: Linking Auditory and Visual Place Perceptions with Soundscape-to-Image Generative Artificial Intelligence
-
KXAN Austin: https://www.kxan.com/news/local/austin/can-ai-visualize-an-environment-from-sounds-ut-researchers-put-it-to-the-test/
-
AOL: https://www.aol.com/ai-visualize-environment-sounds-ut-170110588.html
-
New Atlas: https://newatlas.com/ai-humanoids/ai-street-images-sound/
-
Austin Journal: https://austinjournal.com/stories/666193521-ai-converts-sound-into-street-view-images-using-new-generative-technology
-
Inavate: https://www.inavateonthenet.net/news/article/researchers-use-ai-to-turn-sounds-into-images
-
PetaPixel: https://petapixel.com/2024/12/03/ai-generates-accurate-images-of-streets-from-sound-recordings/
-
TALANOA 'O TONGA: https://talanoaotonga.to/ai-turns-street-sounds-into-realistic-images-with-remarkable-accuracy/
-
Videomaker: https://www.videomaker.com/news/ai-generates-accurate-street-images-from-only-sound-recordings/
If you use this algorithm in your research or applications, please cite this source:
Zhuang, Y., Kang, Y., Fei, T., Bian, M. and Du, Y., 2024. From hearing to seeing: Linking auditory and visual place perceptions with soundscape-to-image generative artificial intelligence. Computers, Environment and Urban Systems, 110, p.102122. https://www.sciencedirect.com/science/article/abs/pii/S0198971524000516
@article{ZHUANG2024102122,
title = {From hearing to seeing: Linking auditory and visual place perceptions with soundscape-to-image generative artificial intelligence},
journal = {Computers, Environment and Urban Systems},
volume = {110},
pages = {102122},
year = {2024},
issn = {0198-9715},
doi = {https://doi.org/10.1016/j.compenvurbsys.2024.102122},
url = {https://www.sciencedirect.com/science/article/pii/S0198971524000516},
author = {Yonggai Zhuang and Yuhao Kang and Teng Fei and Meng Bian and Yunyan Du},
keywords = {Soundscape, Street view images, Sense of place, Stable diffusion, Generative AI, LLMs},
}
People experience the world through multiple senses simultaneously, contributing to our sense of place. Prior quantitative geography studies have mostly emphasized human visual perceptions, neglecting human auditory perceptions at place due to the challenges in characterizing the acoustic environment vividly. Also, few studies have synthesized the two-dimensional (auditory and visual) perceptions in understanding human sense of place. To bridge these gaps, we propose a Soundscape-to-Image Stable Diffusion model, a generative Artificial Intelligence (AI) model supported by Large Language Models (LLMs), aiming to visualize soundscapes through the generation of street view images. By creating audio-image pairs, acoustic environments are first represented as high-dimensional semantic audio vectors. Our proposed Soundscape-to-Image Stable Diffusion model, which contains a Low-Resolution Diffusion Model and a Super-Resolution Diffusion Model, can then translate those semantic audio vectors into visual representations of place effectively. We evaluated our proposed model by using both machine-based and human-centered approaches and proved that the generated street view images align with our common perceptions, and accurately create several key street elements of the original soundscapes. It also demonstrates that soundscapes provide sufficient visual information places. This study stands at the forefront of the intersection between generative AI and human geography, demonstrating how human multi-sensory experiences can be linked. We aim to enrich geospatial data science and AI studies with human experiences. It has the potential to inform multiple domains such as human geography, environmental psychology, and urban design and planning, as well as advancing our knowledge of human-environment relationships.
-
Environment: Python 3.9 or newer
-
Install dependency
pip install -r requirements.txt
We recommend using our pre-trained audio encoder. Please download wlc.pt before starting training or inference.
We also provide CEUS.pt, a pre-trained checkpoint of our Soundscape-to-image model trained on 7.5k audio and image pairs gathered from publicly available Youtube videos.
- Place your training audio dataset and image dataset in the respective directories.
- Ensure that paired audio and image files share the same filename. Here’s an example:
|-- data
| |-- audio
| |-- example_1.wav
| |-- example_2.wav
| |-- image
| |-- example_1.jpg
| |-- example_2.jpg
You can easily train the Soundscape-to-image model by run train.py. Our model consists of two U-Net networks, which need to be trained separately. Therefore, you must run train.py at least twice, once for each U-Net. Usage:
python train.py --train-image-path [PATH_TO_IMAGE_DATASET] --train-audio-path [PATH_TO_AUDIO_DATASET] --pre-trained-audio-encoder [PATH_TO_PRETRAINED_AUDIO_ENCODER] --checkpoint-path [PATH_TO_SAVE_CHECKPOINTS] --batch-size [BATCH_SIZE] --epochs [NUMBER_OF_EPOCHS] --lr-unet [LEARNING_RATE] --train-unet-number [UNET_MODEL_INDEX] --save-every [SAVE_FREQUENCY] --continue-unet-ckpt [PATH_TO_CHECKPOINT]
--train-image-path Path to the directory containing the training images. --train-audio-path Path to the directory containing the training audio files. --pre-trained-audio-encoder Path to the pre-trained audio encoder model file (e.g., wlc.pt). --checkpoint-path Directory where model checkpoints will be saved. --batch-size 5 Number of samples processed per batch during training. --epochs 30 Total number of training epochs. --lr-unet 1e-4 Learning rate for the U-Net models. --train-unet-number 1 Specify which U-Net model to train (1 or 2). --save-every 1 Frequency (in epochs) to save model checkpoints.
Example:
python train.py --train-image-path ./data/image --train-audio-path ./data/audio --pre-trained-audio-encoder ./wlc.pt --checkpoint-path ./checkpoints --batch-size 5 --epochs 30 --lr-unet 1e-4 --train-unet-number 1 --save-every 1
If training was interrupted, you can resume it using:
--continue-unet-ckpt ./checkpoints/imagen_1_10_epochs.pt
Example:
python train.py --train-image-path ./data/image --train-audio-path ./data/audio --pre-trained-audio-encoder ./wlc.pt --checkpoint-path ./checkpoints --batch-size 5 --epochs 30 --lr-unet 1e-4 --train-unet-number 1 --save-every 1 --continue-unet-ckpt ./checkpoints/imagen_1_10_epochs.pt
After training, you can test or apply the model by running sample.py.
Usage:
python inference.py --audio-enconder-ckpt [PATH_TO_AUDIO_ENCODER_CKPT] --unet-ckpt [PATH_TO_UNET_CKPT] --test-audio-path [PATH_TO_TEST_AUDIO] --test-image-path [PATH_TO_OUTPUT_IMAGES] --cond-scale [CONDITION_SCALE]
--audio-enconder-ckpt Path to the pre-trained audio encoder checkpoint (e.g., ./wlc.pt). --unet-ckpt Path to the trained U-Net model checkpoint (e.g., ./checkpoints/imagen_1_30_epochs.pt). --test-audio-path Directory containing test audio files (e.g., ./test_audio). --test-image-path Directory where generated images will be saved (e.g., ./generated_images). --cond-scale Controls how strongly the model adheres to the audio features (default: 1.0).
Example:
python inference.py --audio-enconder-ckpt ./wlc.pt --unet-ckpt ./checkpoints/imagen_1_30_epochs.pt --test-audio-path ./test_audio --test-image-path ./generated_images --cond-scale 1.0
The following folders and files are key components of our project. Please do not delete or move them.
torchvggish → Our audio encoder used for extracting audio embeddings.
imagen_pytorch → Our image generation module, based on Imagen. We have modified some parameters, so you do not need to install the original Imagen separately.
project
|-- torchvggish
|-- imagen_pytorch
|-- train.py
|-- sample.py
Yonggai Zhuang: start128@163.com
Junbo Wang: bbojunbo@gmail.com
Albert Jiang: albertjiang@utexas.edu
Yuhao Kang: yuhao.kang@austin.utexas.edu