Sounding Anagrams

Anagram + Sound in a single image!

Adapted from image-that-sound and visual_anagrams.

Method

We use the combined score

$$\displaystyle \epsilon^t_{\text{combined}}(z_t)=W\epsilon^t_{\text{audio}}(z_t)+(1-W)\epsilon_{\text{image}}^t(z_t),$$

where

$$\epsilon^t_{\text{audio}}(z_t)=\sum_{v \in \text{views}}w^a_{v}v^{-1}(\epsilon_{\theta_{a}}(v(z_t),t,y^a_v)),$$ $$\epsilon^t_{\text{image}}(z_t)=\sum_{v \in \text{views}}w^i_{v}v^{-1}(\epsilon_{\theta_{i}}(v(z_t),t,y^i_v)),$$

s.t. $w^a_v=w^i_v=1, \forall v$ for gaussian-blur hybrids, and $\sum_v w^a_v=\sum w^i_v=1$ for other anagram types.

Notations:

$z_t$ -> latent at timestep $t$
$W$ -> audio weight of the image
$w^i_v$ and $w^a_v$ -> anagram image/audio weight. Used to control visual/auditory emphasis on different views
$y_v^i$ and $y_v^a$ -> image/audio prompts for different views
$\epsilon_{\theta_s}$ -> audio denoisor
$\epsilon_{\theta_i}$ -> image denoisor

This works for all invertible views by the linearity of the denoising process.

The method only produces greyscale images. To obtain colorized images, we use similar idea from above: we make our diffusion model think that it is generating a color hybrid image and use the greyscale image as reference (like impainting).

Examples

Download audio results from assets.

Anagram type: Patch permutation

image_prompt:

(left) painting of cats, lithograph style, grayscale
(right) painting of a dogs, lithograph style, grayscale

audio_prompt:

(left) cat meow
(right) dog bark

image_prompt:

(left) painting of violin, lithograph style, grayscale
(right) painting of cows, lithograph style, grayscale

audio_prompt:

(left) violin music
(right) cow mooing

Anagram type: Patch rotation (90 degrees)

image_prompt:

(left) painting of a park, lithograph style, grayscale
(right) painting of a plane, lithograph style, grayscale

audio_prompt:

(left) bird chirping
(right) airline fly by

image_prompt:

(left) paiting of car, lithograph style, grayscale
(right) painting of bell tower, lithograph style, grayscale

audio_prompt:

(left) car beeping
(right) bell ringing

Colorization

With exactly the same prompts as above:

Patch permutation example:

Patch rotation example:

Main paramters to tune in config.trainer

views This is the anagram type you want to create. See VIEW_MAP dictionary in this file for type list.

anagram_image_weight & anagram_audio_weight This is to control the visual/audial emphasis on different views. Usually a slightly biased weight produces better results.

image_prompt & audio_prompt This is the description of the visual/audial content you want to generate.

audio_weight This is to control the visual/audial balance in the same view. To generate hybrid image without engraved spectrogram, set this parameter to 0.

How to run

You can use the following demo for testing.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
configs/main_config		configs/main_config
src		src
visual_anagrams		visual_anagrams
.project-root		.project-root
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sounding Anagrams

Method

Examples

Anagram type: Patch permutation

Anagram type: Patch rotation (90 degrees)

Colorization

Main paramters to tune in config.trainer

How to run

About

Uh oh!

Releases

Packages

Languages

Kiaelen/sounding_anagrams

Folders and files

Latest commit

History

Repository files navigation

Sounding Anagrams

Method

Examples

Anagram type: Patch permutation

Anagram type: Patch rotation (90 degrees)

Colorization

Main paramters to tune in config.trainer

How to run

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages