Anagram + Sound in a single image!
Adapted from image-that-sound and visual_anagrams.
We use the combined score
where
s.t.
Notations:
-
$z_t$ -> latent at timestep$t$ -
$W$ -> audio weight of the image -
$w^i_v$ and$w^a_v$ -> anagram image/audio weight. Used to control visual/auditory emphasis on different views -
$y_v^i$ and$y_v^a$ -> image/audio prompts for different views -
$\epsilon_{\theta_s}$ -> audio denoisor -
$\epsilon_{\theta_i}$ -> image denoisor
This works for all invertible views by the linearity of the denoising process.
The method only produces greyscale images. To obtain colorized images, we use similar idea from above: we make our diffusion model think that it is generating a color hybrid image and use the greyscale image as reference (like impainting).
Download audio results from assets.
image_prompt:
- (left) painting of cats, lithograph style, grayscale
- (right) painting of a dogs, lithograph style, grayscale
audio_prompt:
- (left) cat meow
- (right) dog bark
image_prompt:
- (left) painting of violin, lithograph style, grayscale
- (right) painting of cows, lithograph style, grayscale
audio_prompt:
- (left) violin music
- (right) cow mooing
image_prompt:
- (left) painting of a park, lithograph style, grayscale
- (right) painting of a plane, lithograph style, grayscale
audio_prompt:
- (left) bird chirping
- (right) airline fly by
image_prompt:
- (left) paiting of car, lithograph style, grayscale
- (right) painting of bell tower, lithograph style, grayscale
audio_prompt:
- (left) car beeping
- (right) bell ringing
With exactly the same prompts as above:
Patch permutation example:
Patch rotation example:
Main paramters to tune in config.trainer
views
This is the anagram type you want to create. See VIEW_MAP dictionary in this file for type list.
anagram_image_weight & anagram_audio_weight
This is to control the visual/audial emphasis on different views. Usually a slightly biased weight produces better results.
image_prompt & audio_prompt
This is the description of the visual/audial content you want to generate.
audio_weight
This is to control the visual/audial balance in the same view. To generate hybrid image without engraved spectrogram, set this parameter to 0.
You can use the following demo for testing.