Skip to content

Commit 7221e2d

Browse files
Added video and audio sections
1 parent 69f76e7 commit 7221e2d

File tree

1 file changed

+14
-4
lines changed

1 file changed

+14
-4
lines changed

unit4/README.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ The video ['Editing Images with Diffusion Models'](https://www.youtube.com/watch
6868
- [SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model](https://arxiv.org/abs/2212.05034) fine-tunes a diffusion model for more accurate mask-guided inpainting.
6969
3) Cross-attention Control: using the cross-attention mechanism in diffusion models to control the spatial location of edits for more fine-grained control.
7070
- [Prompt-to-Prompt Image Editing with Cross Attention Control](https://arxiv.org/abs/2208.01626) is the key paper that introducd this idea, and the technique has [since been applied to Stable Diffusion](https://wandb.ai/wandb/cross-attention-control/reports/Improving-Generative-Images-with-Instructions-Prompt-to-Prompt-Image-Editing-with-Cross-Attention-Control--VmlldzoyNjk2MDAy)
71+
- This idea is also used for 'paint-with-words' ([eDiffi](http://arxiv.org/abs/2211.01324), shown above)
7172
4) Fine-tune ('overfit') on a single image and then generate with the fine-tuned model. The following papers both published variants of this idea at roughly the same time:
7273
- [Imagic: Text-Based Real Image Editing with Diffusion Models](https://arxiv.org/abs/2210.09276)
7374
- [UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image
@@ -78,18 +79,27 @@ The paper [InstructPix2Pix: Learning to Follow Image Editing Instructions](https
7879

7980
## Video
8081

81-
Video diffusion, Imagen Video
82+
![image](https://user-images.githubusercontent.com/6575163/213657523-be40178a-4357-410b-89e3-a4cbd8528900.png)
83+
_Still frames from [sample videos generated with Imagen Video](https://imagen.research.google/video/)_
84+
85+
A video can be represented as a sequence of images, and the core ideas of diffusion models can be applied to these sequences. Recent work has focused on findingappropriate architectures (such as '3D UNets' which operate on entire sequences) and on working efficiently with video data. Since high-frame-rate video involves a lot more data than still images, current approaches tend to first generate low-resolution and low-frame-rate video and then apply spatial and temporal super-resolution to produce the final high-quality video outputs.
8286

8387
Key Papers:
8488
- [Video Diffusion Models](https://video-diffusion.github.io/)
8589
- [IMAGEN VIDEO: HIGH DEFINITION VIDEO GENERATION WITH DIFFUSION MODELS](https://imagen.research.google/video/paper.pdf)
8690

8791
## Audio
8892

89-
- Riffusion (and possibly notebook on the idea)
90-
- Non-spectrogram paper
93+
![image](https://user-images.githubusercontent.com/6575163/213657272-a1b54017-216f-453b-9b28-97c6fef21f54.png)
94+
_A spectrogram generated with Riffusion ([image source](https://www.riffusion.com/about))_
95+
96+
While there has been some work on generating audio directly using diffusion models (e.g. [DiffWave](https://arxiv.org/abs/2009.09761)) the most successful approach so far has been to convert the audio signal into something called a spectrogram, which effectively 'encodes' the audio as a 2D "image" which can then be used to train the kinds of diffusion models we're used to using for image generation. The resulting generated spectrograms can then be converted into audio using existing methods. This approach is behind the recently-released Riffusion, which fine-tuned Stable Diffusion to generate spectrograms conditioned on text - [try it out here](https://www.riffusion.com/).
97+
98+
Key references:
99+
- [DiffWave: A Versatile Diffusion Model for Audio Synthesis](https://arxiv.org/abs/2009.09761)
100+
- ['Riffusion'](https://www.riffusion.com/about) (and [code](https://github.com/riffusion/riffusion))
91101

92-
## New Architectures and Approaches
102+
## New Architectures and Approaches - Towards 'Iterative Refinement'
93103

94104
Transformer in place of UNet (DiT)
95105

0 commit comments

Comments
 (0)