-
Notifications
You must be signed in to change notification settings - Fork 6k
Add Pruna optimization framework documentation #11688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
davidberenstein1957
wants to merge
11
commits into
huggingface:main
Choose a base branch
from
PrunaAI:docs/add-pruna-to-diffusers-optimization
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 7 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
2762d8d
Add Pruna optimization framework documentation
davidberenstein1957 19e1226
Enhance Pruna documentation with image alt text and code block format…
davidberenstein1957 323afe1
Add installation section to Pruna documentation
davidberenstein1957 42832ba
Merge branch 'main' into docs/add-pruna-to-diffusers-optimization
davidberenstein1957 703e3dd
Update pruna.md
davidberenstein1957 f68941d
Update pruna.md
davidberenstein1957 e44e109
Update Pruna documentation for model optimization and evaluation
davidberenstein1957 2f05b6e
Merge branch 'main' into docs/add-pruna-to-diffusers-optimization
davidberenstein1957 6ea2644
Refactor Pruna documentation for clarity and consistency
davidberenstein1957 f6aeaad
Apply suggestions from code review
davidberenstein1957 b680f16
Enhance Pruna documentation with new examples and clarifications
davidberenstein1957 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,224 @@ | ||
# Pruna | ||
|
||
[Pruna](https://github.com/pruna-ai/pruna) is a powerful model optimization framework that helps you unlock maximum performance from your AI models. With Pruna, you can dramatically accelerate inference speeds, reduce memory usage, and optimize model efficiency, all while maintaining a similar output quality. | ||
|
||
Pruna provides a comprehensive suite of cutting-edge optimization algorithms, each carefully designed to address specific performance bottlenecks. From quantization and pruning to advanced caching and compilation techniques, Pruna gives you the tools to fine-tune your models for optimal performance. A general overview of the optimization methods supported by Pruna is shown as follows. | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
| Technique | Description | Speed | Memory | Quality | | ||
|--------------|-----------------------------------------------------------------------------------------------|:-----:|:------:|:-------:| | ||
| `batcher` | Groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing processing time. | ✅ | ❌ | ➖ | | ||
| `cacher` | Stores intermediate results of computations to speed up subsequent operations. | ✅ | ➖ | ➖ | | ||
| `compiler` | Optimises the model with instructions for specific hardware. | ✅ | ➖ | ➖ | | ||
| `distiller` | Trains a smaller, simpler model to mimic a larger, more complex model. | ✅ | ✅ | ❌ | | ||
| `quantizer` | Reduces the precision of weights and activations, lowering memory requirements. | ✅ | ✅ | ❌ | | ||
| `pruner` | Removes less important or redundant connections and neurons, resulting in a sparser, more efficient network. | ✅ | ✅ | ❌ | | ||
| `recoverer` | Restores the performance of a model after compression. | ➖ | ➖ | ✅ | | ||
| `factorizer` | Factorization batches several small matrix multiplications into one large fused operation. | ✅ | ➖ | ➖ | | ||
| `enhancer` | Enhances the model output by applying post-processing algorithms such as denoising or upscaling. | ❌ | - | ✅ | | ||
|
||
✅ (improves), ➖ (approx. the same), ❌ (worsens) | ||
|
||
Explore the full range of optimization methods in [the Pruna documentation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms). | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Installation | ||
|
||
You can install Pruna using the following command: | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```bash | ||
pip install pruna | ||
``` | ||
|
||
Now that you have installed Pruna, you can start to use it to optimize your models. Let's start with optimizing a model. | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Optimize diffusers models | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
After that you can easily optimize any `diffusers` model by defining a simple `SmashConfig`, which holds the configuration for the optimization. | ||
|
||
For `diffusers` models, we support a broad range of optimization algorithms. The overview of the supported optimization algorithms is shown as follows. | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
<div class="flex justify-center"> | ||
<img src="https://huggingface.co/datasets/PrunaAI/documentation-images/resolve/main/diffusers/diffusers_combinations.png" alt="Overview of the supported optimization algorithms for diffusers models"> | ||
</div> | ||
|
||
Let's take a look at an example on how to optimize [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) with Pruna. | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
<div class="flex justify-center"> | ||
<img src="https://huggingface.co/datasets/PrunaAI/documentation-images/resolve/main/diffusers/flux_combination.png" alt="Optimization techniques used for FLUX.1-dev showing the combination of factorizer, compiler, and cacher algorithms"> | ||
</div> | ||
|
||
This combination accelerates inference by up to 4.2× and cuts peak GPU memory usage from 34.7 GB to 28.0 GB, all while maintaining virtually the same output quality. If you want to learn more about the optimization techniques used in this example, you can have a look at [the Pruna documentation on optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html). | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```python | ||
import torch | ||
from diffusers import FluxPipeline | ||
|
||
from pruna import PrunaModel, SmashConfig, smash | ||
|
||
# load the model | ||
# Try segmind/Segmind-Vega or black-forest-labs/FLUX.1-schnell with a small GPU memory | ||
pipe = FluxPipeline.from_pretrained( | ||
"black-forest-labs/FLUX.1-dev", | ||
torch_dtype=torch.bfloat16 | ||
).to("cuda") | ||
|
||
# define the configuration | ||
smash_config = SmashConfig() | ||
smash_config["factorizer"] = "qkv_diffusers" | ||
smash_config["compiler"] = "torch_compile" | ||
smash_config["torch_compile_target"] = "module_list" | ||
smash_config["cacher"] = "fora" | ||
smash_config["fora_interval"] = 2 | ||
|
||
# for the best results in terms of speed you can add these configs | ||
# however they will increase your warmup time from 1.5 min to 10 min | ||
# smash_config["torch_compile_mode"] = "max-autotune-no-cudagraphs" | ||
# smash_config["quantizer"] = "torchao" | ||
# smash_config["torchao_quant_type"] = "fp8dq" | ||
# smash_config["torchao_excluded_modules"] = "norm+embedding" | ||
|
||
# optimize the model | ||
smashed_pipe = smash(pipe, smash_config) | ||
|
||
# run the model | ||
smashed_pipe("a knitted purple prune").images[0] | ||
|
||
# save the model | ||
smashed_pipe.save_to_hub("<username>/FLUX.1-dev-smashed") | ||
|
||
# load the model | ||
smashed_pipe = PrunaModel.from_hub("<username>/FLUX.1-dev-smashed") | ||
``` | ||
|
||
The resulting generated image and inference per optimization configuration are shown as follows. | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
<div class="flex justify-center"> | ||
<img src="https://huggingface.co/datasets/PrunaAI/documentation-images/resolve/main/diffusers/flux_smashed_comparison.png"> | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
</div> | ||
|
||
Besides the results shown above, we have also used Pruna to create [FLUX-juiced, the fastest image generation endpoint alive](https://www.pruna.ai/blog/flux-juiced-the-fastest-image-generation-endpoint). We benchmarked our model against, FLUX.1-dev versions provided by different inference frameworks and surpassed them all. Full results of this benchmark can be found in [our blog post](https://huggingface.co/blog/PrunaAI/flux-fastest-image-generation-endpoint) and [our InferBench space](https://huggingface.co/spaces/PrunaAI/InferBench). | ||
|
||
As you can see, Pruna is a very simple and easy to use framework that allows you to optimize your models with minimal effort. We already saw that the results look good to the naked eye but the cool thing is that you can also use Pruna to benchmark and evaluate your optimized models. | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Evaluate and benchmark diffusers models | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Pruna provides a simple way to evaluate the quality of your optimized models. You can use the `EvaluationAgent` to evaluate the quality of your optimized models. If you want to learn more about the evaluation of optimized models, you can have a look at [the Pruna documentation on evaluation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html). | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Let's take a look at an example on how to evaluate the quality of the optimized model. | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```python | ||
import torch | ||
from diffusers import FluxPipeline | ||
|
||
from pruna import PrunaModel | ||
from pruna.data.pruna_datamodule import PrunaDataModule | ||
from pruna.evaluation.evaluation_agent import EvaluationAgent | ||
from pruna.evaluation.metrics import ( | ||
ThroughputMetric, | ||
TorchMetricWrapper, | ||
TotalTimeMetric, | ||
) | ||
from pruna.evaluation.task import Task | ||
|
||
# define the device | ||
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" | ||
|
||
# load the model | ||
# Try PrunaAI/Segmind-Vega-smashed or PrunaAI/FLUX.1-dev-smashed with a small GPU memory | ||
pipe = FluxPipeline.from_pretrained( | ||
"black-forest-labs/FLUX.1-dev", | ||
torch_dtype=torch.bfloat16 | ||
).to("cpu") | ||
wrapped_pipe = PrunaModel(model=pipe) | ||
smashed_pipe = PrunaModel.from_hub("PrunaAI/FLUX.1-dev-smashed") | ||
|
||
# Define the metrics | ||
metrics = [ | ||
TotalTimeMetric(n_iterations=20, n_warmup_iterations=5), | ||
ThroughputMetric(n_iterations=20, n_warmup_iterations=5), | ||
TorchMetricWrapper("clip"), | ||
] | ||
|
||
# Define the datamodule | ||
datamodule = PrunaDataModule.from_string("LAION256") | ||
datamodule.limit_datasets(10) | ||
|
||
# Define the task and evaluation agent | ||
task = Task(metrics, datamodule=datamodule, device=device) | ||
eval_agent = EvaluationAgent(task) | ||
|
||
# Evaluate base model and offload it to CPU | ||
wrapped_pipe.move_to_device(device) | ||
base_model_results = eval_agent.evaluate(wrapped_pipe) | ||
wrapped_pipe.move_to_device("cpu") | ||
|
||
# Evaluate smashed model and offload it to CPU | ||
smashed_pipe.move_to_device(device) | ||
smashed_model_results = eval_agent.evaluate(smashed_pipe) | ||
smashed_pipe.move_to_device("cpu") | ||
``` | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Evaluate and benchmark standalone diffusers models | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Instead of comparing the optimized model to the base model, you can also evaluate the standalone `diffusers` model. This is useful if you want to evaluate the performance of the model without the optimization. We can do so by using the `PrunaModel` wrapper. | ||
|
||
Let's take a look at an example on how to evaluate and benchmark a standalone `diffusers` model. | ||
|
||
```python | ||
import torch | ||
from diffusers import FluxPipeline | ||
|
||
from pruna import PrunaModel | ||
from pruna.data.pruna_datamodule import PrunaDataModule | ||
from pruna.evaluation.evaluation_agent import EvaluationAgent | ||
from pruna.evaluation.metrics import ( | ||
ThroughputMetric, | ||
TorchMetricWrapper, | ||
TotalTimeMetric, | ||
) | ||
from pruna.evaluation.task import Task | ||
|
||
# define the device | ||
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" | ||
|
||
# load the model | ||
# Try PrunaAI/Segmind-Vega-smashed or PrunaAI/FLUX.1-dev-smashed with a small GPU memory | ||
pipe = FluxPipeline.from_pretrained( | ||
"black-forest-labs/FLUX.1-dev", | ||
torch_dtype=torch.bfloat16 | ||
).to("cpu") | ||
wrapped_pipe = PrunaModel(model=pipe) | ||
|
||
# Define the metrics | ||
metrics = [ | ||
TotalTimeMetric(n_iterations=20, n_warmup_iterations=5), | ||
ThroughputMetric(n_iterations=20, n_warmup_iterations=5), | ||
TorchMetricWrapper("clip"), | ||
] | ||
|
||
# Define the datamodule | ||
datamodule = PrunaDataModule.from_string("LAION256") | ||
datamodule.limit_datasets(10) | ||
|
||
# Define the task and evaluation agent | ||
task = Task(metrics, datamodule=datamodule, device=device) | ||
eval_agent = EvaluationAgent(task) | ||
|
||
# Evaluate base model and offload it to CPU | ||
wrapped_pipe.move_to_device(device) | ||
base_model_results = eval_agent.evaluate(wrapped_pipe) | ||
wrapped_pipe.move_to_device("cpu") | ||
``` | ||
|
||
Now that you have seen how to optimize and evaluate your models, you can start using Pruna to optimize your own models. Luckily, we have many examples to help you get started. | ||
|
||
## Supported models | ||
|
||
Pruna aims to support a wide range of `diffusers` models and even supports different modalities, like text, image, audio, video, and Pruna is constantly expanding its support. An overview of some great combinations of models and modalities that have been succesfully optimized can be found on [the Pruna tutorial page](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html). Finally, a good thing is that Pruna also support `transformers` models. | ||
davidberenstein1957 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Reference | ||
|
||
- [Pruna](https://github.com/pruna-ai/pruna) | ||
- [Pruna optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms) | ||
- [Pruna evaluation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html) | ||
- [Pruna tutorials](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html) | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.