Skip to content

Commit 614b866

Browse files
Merge pull request #49 from huggingface/unit4-tweaks
Unit4 tweaks
2 parents 732a754 + e77f0fd commit 614b866

File tree

2 files changed

+204
-73
lines changed

2 files changed

+204
-73
lines changed

unit4/02_diffusion_for_audio.ipynb

Lines changed: 173 additions & 47 deletions
Large diffs are not rendered by default.

unit4/README.md

Lines changed: 31 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Unit 4: Going Further with Diffusion Models
22

3-
Welcome to Unit 4 of the Hugging Face Diffusion Models Course! In this unit we will look at some of the many improvements and extensions to diffusion models appearing in the latest research. It will be less code-heavy than previous units have been, and is designed to give you a jumping off point for further research.
3+
Welcome to Unit 4 of the Hugging Face Diffusion Models Course! In this unit, we will look at some of the many improvements and extensions to diffusion models appearing in the latest research. It will be less code-heavy than previous units have been and is designed to give you a jumping-off point for further research.
44

55
## Start this Unit :rocket:
66

@@ -11,47 +11,50 @@ Here are the steps for this unit:
1111
- Dive deeper into any specific topics with the linked videos and resources
1212
- Explore the demo notebooks and then read the 'What Next' section for some project suggestions
1313

14-
:loudspeaker: Don't forget to join the [Discord](https://huggingface.co/join/discord), where you can discuss the material and share what you've made in the `#diffusion-models-class` channel.
14+
:loudspeaker: Don't forget to join [Discord](https://huggingface.co/join/discord), where you can discuss the material and share what you've made in the `#diffusion-models-class` channel.
1515

1616
## Table of Contents
1717

18-
- [Faster Sampling via Distillation](#faster-sampling-via-distillation)
19-
- [Training Improvements](#training-improvements)
20-
- [More Control for Generation and Editing](more-control-for-generation-and-editing)
21-
- [Video](#video)
22-
- [Audio](#audio)
23-
- [New Architectures and Approaches - Towards 'Iterative Refinement'](#new-architectures-and-approaches---towards-iterative-refinement)
24-
- [Hands-On Notebooks](#hands-on-notebooks)
25-
- [Where Next?](#where-next)
18+
- [Unit 4: Going Further with Diffusion Models](#unit-4-going-further-with-diffusion-models)
19+
- [Start this Unit :rocket:](#start-this-unit-rocket)
20+
- [Table of Contents](#table-of-contents)
21+
- [Faster Sampling via Distillation](#faster-sampling-via-distillation)
22+
- [Training Improvements](#training-improvements)
23+
- [More Control for Generation and Editing](#more-control-for-generation-and-editing)
24+
- [Video](#video)
25+
- [Audio](#audio)
26+
- [New Architectures and Approaches - Towards 'Iterative Refinement'](#new-architectures-and-approaches---towards-iterative-refinement)
27+
- [Hands-On Notebooks](#hands-on-notebooks)
28+
- [Where Next?](#where-next)
2629

2730

2831
## Faster Sampling via Distillation
2932

30-
Progressive distillation is a technique for taking an existing diffusion model and using it to train a new version of the model that requires fewer steps for inference. The 'student' model is initialized from the weights of the 'teacher' model. During training, the teacher model performs two sampling steps and the student model tries to match the resulting prediction in a single step. This process can be repeated mutiple times, with the previous iteration's student model becoming the teacher for the next stage. The end result is a model that can produce decent samples in much fewer steps (typically 4 or 8) than the original teacher model. The core mechanism is illustrated in this diagram from the [paper that introduced the idea](http://arxiv.org/abs/2202.00512):
33+
Progressive distillation is a technique for taking an existing diffusion model and using it to train a new version of the model that requires fewer steps for inference. The 'student' model is initialized from the weights of the 'teacher' model. During training, the teacher model performs two sampling steps and the student model tries to match the resulting prediction in a single step. This process can be repeated multiple times, with the previous iteration's student model becoming the teacher for the next stage. The result is a model that can produce decent samples in much fewer steps (typically 4 or 8) than the original teacher model. The core mechanism is illustrated in this diagram from the [paper that introduced the idea](http://arxiv.org/abs/2202.00512):
3134

3235
![image](https://user-images.githubusercontent.com/6575163/211016659-7dac24a5-37e2-45f9-aba8-0c573937e7fb.png)
3336

3437
_Progressive Distillation illustrated (from the [paper](http://arxiv.org/abs/2202.00512))_
3538

36-
The idea of using an existing model to 'teach' a new model can be extended to create guided models where the classifier-free guidance technique is used by the teacher model and the student model must learn to produce an equivalent output in a single step based on an additional input specifying the targeted guidance scale. This further reduces the number of model evaluations required to produce high-quality samples. [This video](https://www.youtube.com/watch?v=ZXuK6IRJlnk) gives an overview of the approach.
39+
The idea of using an existing model to 'teach' a new model can be extended to create guided models where the classifier-free guidance technique is used by the teacher model and the student model must learn to produce an equivalent output in a single step based on an additional input specifying the targeted guidance scale. This further reduces the number of model evaluations required to produce high-quality samples. [This video](https://www.youtube.com/watch?v=ZXuK6IRJlnk) gives an overview of the approach.
3740

3841
NB: A distilled version of Stable Diffusion is due to be released fairly soon.
3942

4043
Key references:
41-
- [PROGRESSIVE DISTILLATION FOR FAST SAMPLING OF DIFFUSION MODELS](http://arxiv.org/abs/2202.00512)
42-
- [ON DISTILLATION OF GUIDED DIFFUSION MODELS](http://arxiv.org/abs/2210.03142)
44+
- [Progressive Distillation For Fast Sampling Of Diffusion Models](http://arxiv.org/abs/2202.00512)
45+
- [On Distillation Of Guided Diffusion Models](http://arxiv.org/abs/2210.03142)
4346

4447
## Training Improvements
4548

46-
There have been a number of additional tricks developed to improve diffusion model training. In this section we've tried to capture the core ideas from recent papers. There is a constant stream of research coming out with additional improvements, so if you see a paper you feel should be added here please let us know!
49+
There have been several additional tricks developed to improve diffusion model training. In this section we've tried to capture the core ideas from recent papers. There is a constant stream of research coming out with additional improvements, so if you see a paper you feel should be added here please let us know!
4750

4851
![image](https://user-images.githubusercontent.com/6575163/211021220-e87ca296-cf15-4262-9359-7aeffeecbaae.png)
4952
_Figure 2 from the [ERNIE-ViLG 2.0 paper](http://arxiv.org/abs/2210.15257)_
5053

5154
Key training improvements:
5255
- Tuning the noise schedule, loss weighting and sampling trajectories for more efficient training. An excellent paper exploring some of these design choices is [Elucidating the Design Space of Diffusion-Based Generative Models](http://arxiv.org/abs/2206.00364) by Karras et al.
53-
- Training on diverse aspect ratios, as described in [this video from the course launch event](https://www.youtube.com/watch?v=g6tIUrMvOec).
54-
- Cascaded diffusion models, training one model at low resolution and then one or more super-res models. Used in DALLE-2, Imagen and more for high-resolution image generation.
56+
- Training on diverse aspect ratios, as described in [this video from the course launch event](https://www.youtube.com/watch?v=g6tIUrMvOec).
57+
- Cascaded diffusion models, training one model at low resolution and then one or more super-res models. Used in DALLE-2, Imagen and more for high-resolution image generation.
5558
- Better conditioning, incorporating rich text embeddings ([Imagen](https://arxiv.org/abs/2205.11487) uses a large language model called T5) or multiple types of conditioning ([eDiffi](http://arxiv.org/abs/2211.01324))
5659
- 'Knowledge Enhancement' - incorporating pre-trained image captioning and object detection models into the training process to create more informative captions and produce better performance ([ERNIE-ViLG 2.0](http://arxiv.org/abs/2210.15257))
5760
- 'Mixture of Denoising Experts' (MoDE) - training different variants of the model ('experts') for different noise levels as illustrated in the image above from the [ERNIE-ViLG 2.0 paper](http://arxiv.org/abs/2210.15257).
@@ -64,22 +67,22 @@ Key references:
6467

6568
## More Control for Generation and Editing
6669

67-
In addition to training improvements, there have been a number of innovations in the sampling and inference phase, including many approaches that can add new capabilities to existing diffusion models.
70+
In addition to training improvements, there have been several innovations in the sampling and inference phase, including many approaches that can add new capabilities to existing diffusion models.
6871

6972
![image](https://user-images.githubusercontent.com/6575163/212529129-3de41cf4-6f70-4607-8448-e9bbe9d190cf.png)
7073
_Samples generated by 'paint-with-words' ([eDiffi](http://arxiv.org/abs/2211.01324))_
7174

7275
The video ['Editing Images with Diffusion Models'](https://www.youtube.com/watch?v=zcG7tG3xS3s) gives an overview of the different methods being used to edit existing images with diffusion models. The available techniques can be split into four main categories:
7376

74-
1) Add noise and then denoise with a new prompt. This is the idea behind the img2img pipeline, which has been modified and extended in various papers.
77+
1) Add noise and then denoise with a new prompt. This is the idea behind the `img2img` pipeline, which has been modified and extended in various papers:
7578
- [SDEdit](https://sde-image-editing.github.io/) and [MagicMix](https://magicmix.github.io/) build on this idea
7679
- DDIM inversion (TODO link tutorial) uses the model to 'reverse' the sampling trajectory rather than adding random noise, giving more control
77-
- [Null-text Inversion](https://null-text-inversion.github.io/) enhances the performance of this kind of approach dramatically by optimizing the unconditional text embeddings used for classifier-free-guidance at each step, allowing for extremely high-quality text-based image editing.
80+
- [Null-text Inversion](https://null-text-inversion.github.io/) enhances the performance of this kind of approach dramatically by optimizing the unconditional text embeddings used for classifier-free guidance at each step, allowing for extremely high-quality text-based image editing.
7881
2) Extending the ideas in (1) but with a mask to control where the effect is applied
7982
- [Blended Diffusion](https://omriavrahami.com/blended-diffusion-page/) introduces the basic idea
8083
- [This demo](https://huggingface.co/spaces/nielsr/text-based-inpainting) uses an existing segmentation model (CLIPSeg) to create the mask based on a text description
8184
- [DiffEdit](https://arxiv.org/abs/2210.11427) is an excellent paper that shows how the diffusion model itself can be used to generate an appropriate mask for editing the image based on text.
82-
- [SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model](https://arxiv.org/abs/2212.05034) fine-tunes a diffusion model for more accurate mask-guided inpainting.
85+
- [SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model](https://arxiv.org/abs/2212.05034) fine-tunes a diffusion model for more accurate mask-guided inpainting.
8386
3) Cross-attention Control: using the cross-attention mechanism in diffusion models to control the spatial location of edits for more fine-grained control.
8487
- [Prompt-to-Prompt Image Editing with Cross Attention Control](https://arxiv.org/abs/2208.01626) is the key paper that introduced this idea, and the technique has [since been applied to Stable Diffusion](https://wandb.ai/wandb/cross-attention-control/reports/Improving-Generative-Images-with-Instructions-Prompt-to-Prompt-Image-Editing-with-Cross-Attention-Control--VmlldzoyNjk2MDAy)
8588
- This idea is also used for 'paint-with-words' ([eDiffi](http://arxiv.org/abs/2211.01324), shown above)
@@ -96,7 +99,7 @@ The paper [InstructPix2Pix: Learning to Follow Image Editing Instructions](https
9699
![image](https://user-images.githubusercontent.com/6575163/213657523-be40178a-4357-410b-89e3-a4cbd8528900.png)
97100
_Still frames from [sample videos generated with Imagen Video](https://imagen.research.google/video/)_
98101

99-
A video can be represented as a sequence of images, and the core ideas of diffusion models can be applied to these sequences. Recent work has focused on finding appropriate architectures (such as '3D UNets' which operate on entire sequences) and on working efficiently with video data. Since high-frame-rate video involves a lot more data than still images, current approaches tend to first generate low-resolution and low-frame-rate video and then apply spatial and temporal super-resolution to produce the final high-quality video outputs.
102+
A video can be represented as a sequence of images, and the core ideas of diffusion models can be applied to these sequences. Recent work has focused on finding appropriate architectures (such as '3D UNets' which operate on entire sequences) and on working efficiently with video data. Since high-frame-rate video involves a lot more data than still images, current approaches tend to first generate low-resolution and low-frame-rate video and then apply spatial and temporal super-resolution to produce the final high-quality video outputs.
100103

101104
Key references:
102105
- [Video Diffusion Models](https://video-diffusion.github.io/)
@@ -133,9 +136,9 @@ We are slowly moving beyond the original narrow definition of "diffusion" models
133136

134137
_Pipeline from [MaskGIT](http://arxiv.org/abs/2202.04200)_
135138

136-
The UNet architecture at the heart of many current diffusion models is also being replaced with different alternatives, most notably various transformer-based architectures. In [Scalable Diffusion Models with Transformers (DiT)](https://www.wpeebles.com/DiT) a transformer is used in place of the UNet for a fairly standard diffusion model approach, with excellent results. [Recurrent Interface Networks](https://arxiv.org/pdf/2212.11972.pdf) applies a novel transformer-based architecture and training strategy in pursuit of additional efficiency. [MaskGIT](http://arxiv.org/abs/2202.04200) and [MUSE](http://arxiv.org/abs/2301.00704) use transformer models to work with tokenized representations of images, although the [Paella](https://arxiv.org/abs/2211.07292v1) model demonstrates that a UNet can also be applied successfully to these token-based regimes too.
139+
The UNet architecture at the heart of many current diffusion models is also being replaced with different alternatives, most notably various transformer-based architectures. In [Scalable Diffusion Models with Transformers (DiT)](https://www.wpeebles.com/DiT) a transformer is used in place of the UNet for a fairly standard diffusion model approach, with excellent results. [Recurrent Interface Networks](https://arxiv.org/pdf/2212.11972.pdf) applies a novel transformer-based architecture and training strategy in pursuit of additional efficiency. [MaskGIT](http://arxiv.org/abs/2202.04200) and [MUSE](http://arxiv.org/abs/2301.00704) use transformer models to work with tokenized representations of images, although the [Paella](https://arxiv.org/abs/2211.07292v1) model demonstrates that a UNet can also be applied successfully to these token-based regimes too.
137140

138-
With each new paper more efficient approaches are being developed, and it may be some time before we see what peak performance looks like on these kinds of iterative refinement tasks. There is much more still to explore!
141+
With each new paper, more efficient approaches are being developed, and it may be some time before we see what peak performance looks like on these kinds of iterative refinement tasks. There is much more still to explore!
139142

140143
Key references
141144

@@ -144,11 +147,13 @@ Key references
144147
- [MaskGIT: Masked Generative Image Transformer](http://arxiv.org/abs/2202.04200)
145148
- [Muse: Text-To-Image Generation via Masked Generative Transformers](http://arxiv.org/abs/2301.00704)
146149
- [Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces (Paella)](https://arxiv.org/abs/2211.07292v1)
147-
- [Recurrent Interface Networks](https://arxiv.org/pdf/2212.11972.pdf) - a promising new architecture that does well at generating high-resolution images without relying on latent diffusion or super-resolution. See also, [simple diffusion: End-to-end diffusion for high resolution images](https://arxiv.org/abs/2301.11093) which highlights the importance of the noise schedule for training at higher resolutions.
150+
- [Recurrent Interface Networks](https://arxiv.org/pdf/2212.11972.pdf) - a promising new architecture that does well at generating high-resolution images without relying on latent diffusion or super-resolution. See also, [simple diffusion: End-to-end diffusion for high-resolution images](https://arxiv.org/abs/2301.11093) which highlights the importance of the noise schedule for training at higher resolutions.
148151

149152
## Hands-On Notebooks
150153

151-
TODO link table
154+
| Chapter | Colab | Kaggle | Gradient | Studio Lab |
155+
|:--------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
156+
| Diffusion for Audio | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/diffusion-models-class/blob/main/unit4/02_diffusion_for_audio.ipynb) | [![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/huggingface/diffusion-models-class/blob/main/unit4/02_diffusion_for_audio.ipynb) | [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/huggingface/diffusion-models-class/blob/main/unit4/02_diffusion_for_audio.ipynb) | [![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/diffusion-models-class/blob/main/unit4/02_diffusion_for_audio.ipynb) |
152157

153158
We've covered a LOT of different ideas in this unit, many of which deserve much more detailed follow-on lessons in the future. For now, you can two of the many topics via the hands-on notebooks we've prepared.
154159
- **DDIM Inversion** shows how a technique called inversion can be used to edit images using existing diffusion models

0 commit comments

Comments
 (0)