You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: unit4/README.md
+41-20Lines changed: 41 additions & 20 deletions
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Unit 4: Going Further with Diffusion Models
2
2
3
-
Welcome to Unit 4 of the Hugging Face Diffusion Models Course! In this unit we will look at some of the many improvements and extensions to diffusion models appearing in the latest research. It will be less code-heavy than previous units have been, and is designed to give you a jumping off point for further research and set up for possible additional units in the future.
3
+
Welcome to Unit 4 of the Hugging Face Diffusion Models Course! In this unit we will look at some of the many improvements and extensions to diffusion models appearing in the latest research. It will be less code-heavy than previous units have been, and is designed to give you a jumping off point for further research.
4
4
5
5
## Start this Unit :rocket:
6
6
@@ -12,48 +12,69 @@ Here are the steps for this unit:
12
12
- Complete the [TODO some sort of exercise/capstone project]
13
13
14
14
:loudspeaker: Don't forget to join the [Discord](https://huggingface.co/join/discord), where you can discuss the material and share what you've made in the `#diffusion-models-class` channel.
15
-
16
-
## Introduction
17
-
18
-
By the end of Unit 3...
19
15
20
16
## Faster Sampling via Distillation
21
17
22
-
Introduce the idea... The core mechanism is illustrated in this diagram from the [paper that introduced the idea](http://arxiv.org/abs/2202.00512):
18
+
Progressive distillation is a technique for taking an existing diffusion model and using it to train a new version of the model that requires fewer steps for inference. The 'student' model is initialized from the weights of the 'teacher' model. During training, the teacher model performs two sampling steps and the student model tries to match the resulting prediction in a single step. This process can be repeated mutiple times, with the previous iteration's student model becoming the teacher for the next stage. The end result is a model that can produce decent samples in much fewer steps (typically 4 or 8) than the original teacher model. The core mechanism is illustrated in this diagram from the [paper that introduced the idea](http://arxiv.org/abs/2202.00512):
The idea of using an existing model to 'teach' a new model can be extended to create guided models where the classifier-free guidance technique is used by the teacher model and the student model must learn to produce an equivalent output in a single step based on an additional input specifying the targeted guidance scale. This further reduces the number of model evaluations required to produce high-quality samples.
22
+
The idea of using an existing model to 'teach' a new model can be extended to create guided models where the classifier-free guidance technique is used by the teacher model and the student model must learn to produce an equivalent output in a single step based on an additional input specifying the targeted guidance scale. This further reduces the number of model evaluations required to produce high-quality samples. [This video](https://www.youtube.com/watch?v=ZXuK6IRJlnk) gives an overview of the approach.
23
+
24
+
NB: A distilled version of Stable Diffusion is due to be released fairly soon.
27
25
28
26
Key papers:
29
27
-[PROGRESSIVE DISTILLATION FOR FAST SAMPLING OF DIFFUSION MODELS](http://arxiv.org/abs/2202.00512)
30
28
-[ON DISTILLATION OF GUIDED DIFFUSION MODELS](http://arxiv.org/abs/2210.03142)
31
29
32
30
## Training Improvements
33
31
32
+
There have been a number of additional tricks developed to improve diffusion model training. In this section we've tried to capture the core ideas from recent papers. There is a constant stream of research coming out with additional improvements, so if you see a paper you feel should be added here please let us know!
_Figure 2 from the [ERNIE-ViLG 2.0 paper](http://arxiv.org/abs/2210.15257)_
36
36
37
-
There have been a number of additional tricks developed to improve training. A few key ones are
38
-
- Tuning the noise schedule, loss weighting and sampling trajectories (Karras et al)
39
-
- Training on diverse aspect rations[TODO link patrick's talk from the launch event]
40
-
-Cascading diffusion models, training one model at low resolution and then one or more super-res models (D2, Imagen, eDiffi)
41
-
-Rich text embeddings (Imagen) or multiple types of conditioning (eDiffi)
42
-
-Incorporating pre-trained image captioning and object detection models into the training process to create more informative captions and produce better performance in a process known as 'knowledge enhancement' (ERNIE-ViLG 2.0)
43
-
-MoE training different variants of the model ('experts') for different noise levels...
37
+
Key training improvements:
38
+
- Tuning the noise schedule, loss weighting and sampling trajectories for more efficient training. An excellent paper exploring some of these design choices is [Elucidating the Design Space of Diffusion-Based Generative Models](http://arxiv.org/abs/2206.00364) by Karras et al.
39
+
- Training on diverse aspect rations, as described in [this video from the course launch event](https://www.youtube.com/watch?v=g6tIUrMvOec).
40
+
-Cascaded diffusion models, training one model at low resolution and then one or more super-res models. Used in DALLE-2, Imagen and more for high-resolution image generation.
41
+
-Better conditioning, incorporating rich text embeddings ([Imagen](https://arxiv.org/abs/2205.11487) uses a large language model called T5) or multiple types of conditioning ([eDiffi](http://arxiv.org/abs/2211.01324))
42
+
-'Knowledge Enhancement' - incorporating pre-trained image captioning and object detection models into the training process to create more informative captions and produce better performance ([ERNIE-ViLG 2.0](http://arxiv.org/abs/2210.15257))
43
+
-'Mixture of Denoising Experts' (MoDE) - training different variants of the model ('experts') for different noise levels as illustrated in the image above from the [ERNIE-ViLG 2.0 paper](http://arxiv.org/abs/2210.15257).
44
44
45
45
Key Papers:
46
46
-[Elucidating the Design Space of Diffusion-Based Generative Models](http://arxiv.org/abs/2206.00364)
47
47
-[eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers](http://arxiv.org/abs/2211.01324)
48
48
-[ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts](http://arxiv.org/abs/2210.15257)
49
-
-[Imagen][TODO]
49
+
-[Imagen - Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding](https://arxiv.org/abs/2205.11487) ([demo site](https://imagen.research.google/))
50
+
51
+
## More Control for Generation and Editing
52
+
53
+
In addition to training improvements, there have been a number of innovations in the sampling and inference phase, including many approaches that can add new capabilities to existing diffusion models.
_Samples generated by 'paint-with-words' ([eDiffi](http://arxiv.org/abs/2211.01324))_
57
+
58
+
The video ['Editing Images with Diffusion Models'](https://www.youtube.com/watch?v=zcG7tG3xS3s) gives an overview of the different methods being used to edit existing images with diffusion models. The available techniques can be split into four main categories:
59
+
60
+
1) Add noise and then denoise with a new prompt. This is the idea behind the img2img pipeline, which has been modified and extended in various papers.
61
+
-[SDEdit](https://sde-image-editing.github.io/) and [MagicMix](https://magicmix.github.io/) build on this idea
62
+
- DDIM inversion (TODO link tutorial) uses the model to 'reverse' the sampling trajectory rather than adding random noise, giving more control
63
+
-[Null-text Inversion](https://null-text-inversion.github.io/) enhances the performance of this kind of approach dramatically by optimizing the unconditional text embeddings used for classifier-free-guidance at each step, allowing for extremely high-quality text-based image editing.
64
+
2) Extending the ideas in (1) but with a mask to control where the effect is applied
65
+
-[Blended Diffusion](https://omriavrahami.com/blended-diffusion-page/) introduces the basic idea
66
+
- This demo](https://huggingface.co/spaces/nielsr/text-based-inpainting) uses an existing segmentation model (CLIPSeg) to create the mask based on a text description
67
+
-[DiffEdit](https://arxiv.org/abs/2210.11427) is an excellent paper that shows how the diffusion model itself can be used to generate an appropriate mask for editing the image based on text.
68
+
-[SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model](https://arxiv.org/abs/2212.05034) fine-tunes a diffusion model for more accurate mask-guided inpainting.
69
+
3) Cross-attention Control: using the cross-attention mechanism in diffusion models to control the spatial location of edits for more fine-grained control.
70
+
-[Prompt-to-Prompt Image Editing with Cross Attention Control](https://arxiv.org/abs/2208.01626) is the key paper that introducd this idea, and the technique has [since been applied to Stable Diffusion](https://wandb.ai/wandb/cross-attention-control/reports/Improving-Generative-Images-with-Instructions-Prompt-to-Prompt-Image-Editing-with-Cross-Attention-Control--VmlldzoyNjk2MDAy)
71
+
4) Fine-tune ('overfit') on a single image and then generate with the fine-tuned model. The following papers both published variants of this idea at roughly the same time:
72
+
-[Imagic: Text-Based Real Image Editing with Diffusion Models](https://arxiv.org/abs/2210.09276)
73
+
-[UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image
74
+
](https://arxiv.org/abs/2210.09477)
50
75
51
-
## Inference Improvements
76
+
The paper [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800) is notable in that it used some of the image editing techniques described above to build a synthetic dataset of image pairs alongside image edit instructions (generated with GPT3.5) to train a new model capable of editing images based on natural language instructions
0 commit comments