Skip to content

[Performance] Parallelize modifier compression #1558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 41 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
1aea4dd
wip: alignment context
kylesayrs Jun 3, 2025
6705bf4
touchups based on remaining steps
brian-dellabetta Jun 5, 2025
cf1f87d
implement oneshot_device, pipeline warnings
kylesayrs Jun 6, 2025
97c8d30
simplify example
kylesayrs Jun 6, 2025
ecfe15d
move offloading outside of preprocess, which is shared with train
kylesayrs Jun 6, 2025
6f86244
cleanup
kylesayrs Jun 6, 2025
929f678
update examples, remove offload devicemap utils
kylesayrs Jun 6, 2025
0348243
Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…
kylesayrs Jun 6, 2025
a275f53
update examples to load before generating
kylesayrs Jun 6, 2025
9d6c227
remove hooks
kylesayrs Jun 10, 2025
fab6fe1
Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…
kylesayrs Jun 10, 2025
6fdcdb1
Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…
kylesayrs Jun 10, 2025
8351ac9
name change
kylesayrs Jun 10, 2025
ad71c5b
cleanup and nits
kylesayrs Jun 12, 2025
819df1c
rename function
kylesayrs Jun 12, 2025
6d942cc
Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…
kylesayrs Jun 12, 2025
7dd71b9
add dispatch utility
kylesayrs Jun 12, 2025
8ba0f2c
apply style
kylesayrs Jun 12, 2025
fbf2a6d
update examples
kylesayrs Jun 13, 2025
91b349b
update examples 2
kylesayrs Jun 13, 2025
8e58e35
remove fallback_to_cpu, use ct utils
kylesayrs Jun 13, 2025
96631d1
remove hook from module within utils function
kylesayrs Jun 15, 2025
96476fe
remove unused util
kylesayrs Jun 15, 2025
2d87993
Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…
kylesayrs Jun 16, 2025
cb965c9
docstring
kylesayrs Jun 16, 2025
8769b85
remove big model example tests
kylesayrs Jun 16, 2025
a389d14
big modeling example readme
kylesayrs Jun 16, 2025
b336fa2
deprecate sequential_targets on modifiers
kylesayrs Jun 16, 2025
34ef394
update examples
kylesayrs Jun 16, 2025
58fe929
fix deprecation warning
kylesayrs Jun 16, 2025
54ef06a
fix layer sequential pipeline
kylesayrs Jun 16, 2025
4bb86e5
remove unused import
kylesayrs Jun 16, 2025
b2367ce
dispatch in pipelines
kylesayrs Jun 16, 2025
06bb661
add train dispatch
kylesayrs Jun 16, 2025
a64a777
use remove_dispatch
kylesayrs Jun 16, 2025
8f71004
fix example
kylesayrs Jun 16, 2025
7d7b00d
remove device arg from e2e
kylesayrs Jun 16, 2025
501056e
simplify pipeline inference logic, add comment
kylesayrs Jun 16, 2025
74aa7c9
update examples imports
kylesayrs Jun 16, 2025
e4487e2
fix call
kylesayrs Jun 16, 2025
f134e56
wip: run compression in parallel
kylesayrs Jun 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 1 addition & 5 deletions examples/awq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,7 @@ recipe = [
To use your own model, start with an existing example change the `model_id` to match your own model stub.
```python
model_id = "path/to/your/model"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
```

## Adding Mappings ##
Expand Down
4 changes: 1 addition & 3 deletions examples/awq/llama_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,7 @@
# Select model and load it.
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto"
)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Select calibration dataset.
Expand Down
6 changes: 3 additions & 3 deletions examples/awq/qwen3_moe_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,12 @@

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.utils import dispatch_for_generation

# Select model and load it.
MODEL_ID = "Qwen/Qwen3-30B-A3B"

model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto"
)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Select calibration dataset.
Expand Down Expand Up @@ -71,6 +70,7 @@ def tokenize(sample):
# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
Expand Down
95 changes: 0 additions & 95 deletions examples/big_models_with_accelerate/README.md

This file was deleted.

26 changes: 0 additions & 26 deletions examples/big_models_with_accelerate/cpu_offloading_fp8.py

This file was deleted.

81 changes: 0 additions & 81 deletions examples/big_models_with_accelerate/mult_gpus_int8_device_map.py

This file was deleted.

78 changes: 0 additions & 78 deletions examples/big_models_with_accelerate/multi_gpu_int8.py

This file was deleted.

12 changes: 12 additions & 0 deletions examples/big_models_with_sequential_onloading/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
## Big Modeling with Sequential Onloading ##
### What is Sequential Onloading? ###
Sequential onloading is a memory-efficient approach for compressing large language models (LLMs) using only a single GPU. Instead of loading the entire model into memory—which can easily require hundreds of gigabytes—this method loads and compresses one layer at a time. The outputs are offloaded before the next layer is processed, dramatically reducing peak memory usage while maintaining high compression fidelity.

<p align="center">
<img src="assets/sequential_onloading.png"/>
</p>

For more information, see the [RedHat AI blog post](https://developers.redhat.com/articles/2025/05/09/llm-compressor-optimize-llms-low-latency-deployments#generalizing_to_multimodal_and_moe_architectures) or the [LLM Compressor Office Hours Recording](https://www.youtube.com/watch?v=GrhuqQDmBk8).

### Using Sequential Onloading ###
Sequential onloading is enabled by default within LLM Compressor. To disable sequential onloading, add the `pipeline="basic"` argument to the LLM Compressor `oneshot` function call.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 1 addition & 5 deletions examples/multimodal_audio/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,7 @@ This directory contains example scripts for quantizing a variety of audio langua
To use your own multimodal modal, start with an existing example change the `model_id` to match your own model stub.
```python3
model_id = "path/to/your/model"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
```

## Customizing GPTQModifier Parameters ##
Expand Down
9 changes: 3 additions & 6 deletions examples/multimodal_audio/whisper_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,12 @@

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.utils import dispatch_for_generation

# Select model and load it.
MODEL_ID = "openai/whisper-large-v3"

model = WhisperForConditionalGeneration.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype="auto")
model.config.forced_decoder_ids = None
processor = WhisperProcessor.from_pretrained(MODEL_ID)

Expand Down Expand Up @@ -91,13 +88,13 @@ def data_collator(batch):
# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
sample_features = next(iter(ds))["input_features"]
sample_decoder_ids = [processor.tokenizer.prefix_tokens]
sample_input = {
"input_features": torch.tensor(sample_features).to(model.device),
"decoder_input_ids": torch.tensor(sample_decoder_ids).to(model.device),
}

output = model.generate(**sample_input, language="en")
print(processor.batch_decode(output, skip_special_tokens=True))
print("==========================================\n\n")
Expand Down
Loading
Loading