Skip to content

Commit 6a5e420

Browse files
authored
Merge branch 'main' into gptq_tc
2 parents 200ceac + de8b1c5 commit 6a5e420

File tree

73 files changed

+962
-1054
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

73 files changed

+962
-1054
lines changed

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,14 @@
1414

1515
## 🚀 What's New!
1616

17-
Big updates have landed in LLM Compressor! Check out these exciting new features:
17+
Big updates have landed in LLM Compressor! To get a more in-depth look, check out the [deep-dive](https://x.com/RedHat_AI/status/1937865425687093554).
1818

19+
Some of the exciting new features include:
20+
21+
* **Large Model Support with Sequential Onloading** As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py).
1922
* **Preliminary FP4 Quantization Support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4/llama3_example.py) and [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py). Support is currently preliminary and additional support will be added for MoEs.
23+
* **Updated AWQ Support:** Improved support for MoEs with better handling of larger models
2024
* **Axolotl Sparse Finetuning Integration:** Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
21-
* **AutoAWQ Integration:** Perform low-bit weight-only quantization efficiently using AutoAWQ, now part of LLM Compressor. *Note: This integration should be considered experimental for now. Enhanced support, including for MoE models and improved handling of larger models via layer sequential pipelining, is planned for upcoming releases.* [See the details](https://github.com/vllm-project/llm-compressor/pull/1177).
2225
* **Day 0 Llama 4 Support:** Meta utilized LLM Compressor to create the [FP8-quantized Llama-4-Maverick-17B-128E](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8), optimized for vLLM inference using [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format.
2326

2427
### Supported Formats

docs/observers.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Observers Overview
2+
3+
An `Observer` in `llm-compressor` is a utility class responsible for analyzing tensors (e.g., weights, activations) and producing quantization parameters such as `scale` and `zero_point`. These observers are used by quantization modifiers to compute the statistics necessary for transforming tensors into lower precision formats.
4+
5+
Observers are designed to be flexible and support a variety of quantization strategies, including per-tensor, per-group, per-channel, and per-token quantization.
6+
7+
## Base Class
8+
9+
### [Observer](../src/llmcompressor/observers/base.py)
10+
Base class for all observers. Subclasses must implement the `calculate_qparams` method to define how quantization parameters are computed.
11+
12+
The base class handles:
13+
- Group-wise scale/zero_point computation
14+
- Token-wise and channel-wise quantization logic
15+
- Optional support for `g_idx` (group index mappings)
16+
- Recording observed tokens for logging and analysis
17+
- Resetting internal state during lifecycle transitions
18+
19+
This class is not used directly but provides the scaffolding for all custom observers.
20+
21+
## Implemented Observers
22+
23+
### [MinMax](../src/llmcompressor/observers/min_max.py)
24+
Computes `scale` and `zero_point` by tracking the minimum and maximum of the observed tensor. This is the simplest and most common observer. Works well for symmetric and asymmetric quantization.
25+
26+
Best used for:
27+
- Int8 or Int4 symmetric quantization
28+
- Channel-wise or group-wise strategies
29+
30+
### [MSE](../src/llmcompressor/observers/mse.py)
31+
Computes quantization parameters by minimizing the Mean Squared Error (MSE) between the original and quantized tensor. Optionally maintains a moving average of min/max values for smoother convergence.
32+
33+
Best used when:
34+
- Calibration accuracy is critical
35+
- Quantization error needs to be tightly controlled
36+
37+
## Quantization Strategies
38+
39+
Observers support multiple quantization strategies via the `QuantizationArgs.strategy` field:
40+
41+
- `TENSOR`: Global scale and zero_point across entire tensor.
42+
- `GROUP`, `TENSOR_GROUP`: Slice tensor into equal-sized groups along columns.
43+
- `CHANNEL`: Per-channel quantization (e.g., across output dimensions).
44+
- `TOKEN`: Quantize activations along token or sequence dimensions.
45+
- `BLOCK`: *(Not yet implemented)* Placeholder for block-wise quantization.
46+
47+
## Observer Configuration Parameters
48+
49+
Observers can be configured with optional keyword arguments that control their behavior. These are passed through the `QuantizationArgs.observer_kwargs` dictionary and parsed internally when the observer is initialized.
50+
51+
Below are the supported configuration parameters and their meanings:
52+
53+
| Argument | Default Value |
54+
|---------------------|---------------|
55+
| `maxshrink` | `0.20` |
56+
| `patience` | `5` |
57+
| `averaging_constant`| `0.01` |
58+
| `grid` | `100.0` |
59+
| `norm` | `2.0` |
60+
61+
## Example Usage
62+
63+
```python
64+
from llmcompressor.observers import Observer
65+
from compressed_tensors.quantization.quant_args import QuantizationArgs
66+
67+
args = QuantizationArgs(num_bits=4, strategy="group", group_size=128)
68+
observer = Observer.load_from_registry("minmax", quantization_args=args)
69+
70+
x = torch.randn(64, 512)
71+
scale, zero_point = observer(x)
72+
```
73+
74+
## Example yaml Usage
75+
``` yaml
76+
quantization_stage:
77+
quantization_modifiers:
78+
GPTQModifier:
79+
weights:
80+
observer: mse
81+
observer_kwargs:
82+
maxshrink: 0.1
83+
patience: 10
84+
averaging_constant: 0.05
85+
grid: 128.0
86+
norm: 2.0
87+
num_bits: 4
88+
type: int
89+
symmetric: true
90+
strategy: channel
91+
targets:
92+
- Linear
93+
```
Lines changed: 37 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
## Big Modeling with Sequential Onloading ##
2-
### What is Sequential Onloading? ###
1+
# Big Modeling with Sequential Onloading #
2+
## What is Sequential Onloading? ##
33
Sequential onloading is a memory-efficient approach for compressing large language models (LLMs) using only a single GPU. Instead of loading the entire model into memory—which can easily require hundreds of gigabytes—this method loads and compresses one layer at a time. The outputs are offloaded before the next layer is processed, dramatically reducing peak memory usage while maintaining high compression fidelity.
44

55
<p align="center">
@@ -8,5 +8,38 @@ Sequential onloading is a memory-efficient approach for compressing large langua
88

99
For more information, see the [RedHat AI blog post](https://developers.redhat.com/articles/2025/05/09/llm-compressor-optimize-llms-low-latency-deployments#generalizing_to_multimodal_and_moe_architectures) or the [LLM Compressor Office Hours Recording](https://www.youtube.com/watch?v=GrhuqQDmBk8).
1010

11-
### Using Sequential Onloading ###
12-
Sequential onloading is enabled by default within LLM Compressor. To disable sequential onloading, add the `pipeline="basic"` argument to the LLM Compressor `oneshot` function call.
11+
## Using Sequential Onloading ##
12+
Sequential onloading is enabled by default within LLM Compressor. To disable sequential onloading, add the `pipeline="basic"` argument to the LLM Compressor `oneshot` function call.
13+
14+
## Running Llama 3.3 70b ##
15+
The Llama 3.3 70b is larger than 80 GB, surpassing the size of 1 A100. However, with sequential onloading, this model can still be quantized seamlessly using a single GPU.
16+
17+
### Code Walkthough
18+
19+
```python
20+
model_id = "meta-llama/Llama-3.3-70B-Instruct"
21+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map=None)
22+
```
23+
24+
The model is first loaded onto the `cpu`, as indicated through the use of `None` for the `device_map` argument in the `from_pretrained` method when loading the model.
25+
26+
```python
27+
oneshot(
28+
model=model,
29+
dataset=ds,
30+
recipe=recipe,
31+
max_seq_length=MAX_SEQUENCE_LENGTH,
32+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
33+
)
34+
```
35+
During `oneshot`, only one gpu is required which will be used to onload each layer for calibration in a sequential manner.
36+
37+
```python
38+
dispatch_for_generation(model)
39+
sample = tokenizer("Hello my name is", return_tensors="pt")
40+
sample = {key: value.to("cuda") for key, value in sample.items()}
41+
output = model.generate(**sample, max_new_tokens=100)
42+
print(tokenizer.decode(output[0]))
43+
```
44+
45+
Finally, we call `dispatch_for_generation` to evenly load the model across available devices (potentially offloading the model if required) and run sample generations on the newly quantized model.
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
from datasets import load_dataset
2+
from transformers import AutoModelForCausalLM, AutoTokenizer
3+
4+
from llmcompressor.modifiers.quantization import GPTQModifier
5+
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
6+
from llmcompressor.transformers import oneshot
7+
from llmcompressor.utils import dispatch_for_generation
8+
9+
# Select model and load it.
10+
model_id = "meta-llama/Llama-3.3-70B-Instruct"
11+
model = AutoModelForCausalLM.from_pretrained(
12+
model_id,
13+
torch_dtype="auto",
14+
device_map=None,
15+
)
16+
tokenizer = AutoTokenizer.from_pretrained(model_id)
17+
18+
# Select calibration dataset.
19+
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
20+
DATASET_SPLIT = "train_sft"
21+
22+
# Select number of samples. 512 samples is a good place to start.
23+
# Increasing the number of samples can improve accuracy.
24+
NUM_CALIBRATION_SAMPLES = 512
25+
MAX_SEQUENCE_LENGTH = 2048
26+
27+
# Load dataset and preprocess.
28+
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
29+
ds = ds.shuffle(seed=42)
30+
31+
32+
def preprocess(example):
33+
return {
34+
"text": tokenizer.apply_chat_template(
35+
example["messages"],
36+
tokenize=False,
37+
)
38+
}
39+
40+
41+
ds = ds.map(preprocess)
42+
43+
44+
# Tokenize inputs.
45+
def tokenize(sample):
46+
return tokenizer(
47+
sample["text"],
48+
padding=False,
49+
max_length=MAX_SEQUENCE_LENGTH,
50+
truncation=True,
51+
add_special_tokens=False,
52+
)
53+
54+
55+
ds = ds.map(tokenize, remove_columns=ds.column_names)
56+
57+
# Configure the quantization algorithm to run.
58+
# * apply SmoothQuant to make the activations easier to quantize
59+
# * quantize the weights to int8 with GPTQ (static per channel)
60+
# * quantize the activations to int8 (dynamic per token)
61+
recipe = [
62+
SmoothQuantModifier(smoothing_strength=0.8),
63+
GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
64+
]
65+
# Apply algorithms.
66+
oneshot(
67+
model=model,
68+
dataset=ds,
69+
recipe=recipe,
70+
max_seq_length=MAX_SEQUENCE_LENGTH,
71+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
72+
)
73+
74+
# Confirm generations of the quantized model look sane.
75+
print("\n\n")
76+
print("========== SAMPLE GENERATION ==============")
77+
dispatch_for_generation(model)
78+
sample = tokenizer("Hello my name is", return_tensors="pt")
79+
sample = {key: value.to("cuda") for key, value in sample.items()}
80+
output = model.generate(**sample, max_new_tokens=100)
81+
print(tokenizer.decode(output[0]))
82+
print("==========================================\n\n")
83+
84+
# Save to disk compressed.
85+
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W8A8"
86+
model.save_pretrained(SAVE_DIR, save_compressed=True)
87+
tokenizer.save_pretrained(SAVE_DIR)

examples/finetuning/example_alternating_recipe.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ initial_sparsity_stage:
44
SparseGPTModifier:
55
sparsity: 0.5
66
block_size: 128
7-
percdamp: 0.01
7+
dampening_frac: 0.01
88
mask_structure: "0:0"
99
targets: ["Linear"]
1010
ignore: ["re:.*lm_head"]
@@ -20,7 +20,7 @@ next_sparsity_stage:
2020
SparseGPTModifier:
2121
sparsity: 0.7
2222
block_size: 128
23-
percdamp: 0.01
23+
dampening_frac: 0.01
2424
mask_structure: "0:0"
2525
targets: ["Linear"]
2626
ignore: ["re:.*lm_head"]

examples/quantization_kv_cache/gemma2_fp8_kv_example.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,11 +87,12 @@ def process_and_tokenize(example):
8787
# NOTE: transformers 4.49.0 results in a generation error with gemma2.
8888
# Consider either downgrading your transformers version to a previous version
8989
# or use vLLM for sample generation.
90+
# Note: compile is disabled: https://github.com/huggingface/transformers/issues/38333
9091
print("\n\n")
9192
dispatch_for_generation(model)
9293
print("========== SAMPLE GENERATION ==============")
9394
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
94-
output = model.generate(input_ids, max_new_tokens=100)
95+
output = model.generate(input_ids, max_new_tokens=100, disable_compile=True)
9596
print(tokenizer.decode(output[0]))
9697
print("==========================================\n\n")
9798

examples/quantization_w8a8_fp8/gemma2_example.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,11 @@
2929
# NOTE: transformers 4.49.0 results in a generation error with gemma2.
3030
# Consider either downgrading your transformers version to a previous version
3131
# or use vLLM for sample generation.
32+
# Note: compile is disabled: https://github.com/huggingface/transformers/issues/38333
3233
print("========== SAMPLE GENERATION ==============")
3334
dispatch_for_generation(model)
3435
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
35-
output = model.generate(input_ids, max_new_tokens=20)
36+
output = model.generate(input_ids, max_new_tokens=20, disable_compile=True)
3637
print(tokenizer.decode(output[0]))
3738
print("==========================================")
3839

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
from datasets import load_dataset
2+
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
3+
4+
from llmcompressor.modeling import prepare_for_calibration
5+
from llmcompressor.modifiers.quantization import GPTQModifier
6+
from llmcompressor.transformers import oneshot
7+
8+
# Select model and load it.
9+
10+
# This script takes about 48 hours on 1xA100 to complete.
11+
# Future improvements will reduce this runtime (#1561, #1558).
12+
13+
# For DeepSeek-R1, we require a full precision model in order to properly calibrate
14+
# `DeepSeek-R1-0528-BF16` is a DeepSeek-V3 FP8 model which has been converted to BF16
15+
16+
model_id = "unsloth/DeepSeek-R1-0528-BF16"
17+
config = AutoConfig.from_pretrained(model_id)
18+
del config.quantization_config # fp8 qconfig no longer appplies to bf16 model
19+
model = AutoModelForCausalLM.from_pretrained(
20+
model_id, torch_dtype="auto", config=config
21+
)
22+
tokenizer = AutoTokenizer.from_pretrained(model_id)
23+
model = prepare_for_calibration(model)
24+
25+
# Select calibration dataset.
26+
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
27+
DATASET_SPLIT = "train_sft"
28+
29+
# Select number of samples. 512 samples is a good place to start.
30+
# Increasing the number of samples can improve accuracy.
31+
NUM_CALIBRATION_SAMPLES = 512
32+
MAX_SEQUENCE_LENGTH = 2048
33+
34+
# Load dataset and preprocess.
35+
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
36+
ds = ds.shuffle(seed=42)
37+
38+
39+
def preprocess(example):
40+
return {
41+
"text": tokenizer.apply_chat_template(
42+
example["messages"],
43+
tokenize=False,
44+
)
45+
}
46+
47+
48+
ds = ds.map(preprocess)
49+
50+
51+
# Tokenize inputs.
52+
def tokenize(sample):
53+
return tokenizer(
54+
sample["text"],
55+
padding=False,
56+
max_length=MAX_SEQUENCE_LENGTH,
57+
truncation=True,
58+
add_special_tokens=False,
59+
)
60+
61+
62+
ds = ds.map(tokenize, remove_columns=ds.column_names)
63+
64+
# Configure the quantization algorithm to run.
65+
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
66+
# list so they remain at full precision
67+
recipe = GPTQModifier(
68+
targets="Linear", scheme="W4A16", ignore=["lm_head", "re:.*mlp.gate$"]
69+
)
70+
71+
# Apply algorithms.
72+
# due to the large size of DeepSeekV3, we specify sequential targets such that
73+
# only one MLP is loaded into GPU memory at a time
74+
oneshot(
75+
model=model,
76+
dataset=ds,
77+
recipe=recipe,
78+
max_seq_length=MAX_SEQUENCE_LENGTH,
79+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
80+
sequential_targets=["DeepseekV3Attention", "DeepseekV3MLP"],
81+
)
82+
83+
# Save to disk compressed.
84+
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128"
85+
model.save_pretrained(SAVE_DIR, save_compressed=True)
86+
tokenizer.save_pretrained(SAVE_DIR)

0 commit comments

Comments
 (0)