Skip to content

Commit f30da30

Browse files
authored
Merge branch 'vllm-project:main' into main
2 parents 5288bec + 53240c6 commit f30da30

File tree

39 files changed

+685
-227
lines changed

39 files changed

+685
-227
lines changed
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
name: prepare code coverage
2+
description: installs code coverage dependencies and exports an updated 'PYTEST_ADDOPTS' env var
3+
4+
runs:
5+
using: composite
6+
steps:
7+
- run: |-
8+
# install dependencies
9+
pip3 install coverage pytest-cov https://github.com/neuralmagic/pytest-nm-releng/archive/v0.4.0.tar.gz
10+
11+
# generate and source flags
12+
FLAGS_FILE="coverage_flags.sh"
13+
nmre-generate-coverage-flags --package "llmcompressor" --output-file "$FLAGS_FILE"
14+
source "$FLAGS_FILE"
15+
rm "$FLAGS_FILE"
16+
17+
# export defined/updated 'PYTEST_ADDOPTS' env var
18+
echo "PYTEST_ADDOPTS=$PYTEST_ADDOPTS" | tee -a "$GITHUB_ENV"
19+
shell: bash

.github/workflows/test-check-transformers.yaml

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,16 @@
11
name: Test Checks (Transformers)
22
on:
33
pull_request:
4-
branches: main
4+
branches: [ main ]
55
types: [ labeled, synchronize ]
66
push:
7-
branches: main
7+
branches: [ main ]
8+
workflow_dispatch:
9+
inputs:
10+
code_coverage:
11+
description: if enabled, code coverage metrics will be collected during the test run
12+
type: boolean
13+
default: false
814

915
env:
1016
CADENCE: "commit"
@@ -72,6 +78,9 @@ jobs:
7278
BUILD_TYPE=nightly pip3 install .
7379
- name: "Clean compressed-tensors directory"
7480
run: rm -r compressed-tensors/
81+
- name: "⚙️ Prepare code coverage"
82+
if: inputs.code_coverage
83+
uses: ./.github/actions/prepare-code-coverage
7584
- name: "🔬 Running transformers tests"
7685
if: (success() || failure()) && steps.install.outcome == 'success'
7786
run: |
@@ -104,3 +113,13 @@ jobs:
104113
if: (success() || failure()) && steps.install.outcome == 'success'
105114
run: |
106115
pytest -v tests/llmcompressor/transformers/kv_cache
116+
- name: "Upload coverage report"
117+
if: (success() || failure()) && inputs.code_coverage
118+
uses: actions/upload-artifact@v4
119+
with:
120+
name: transformers-tests-coverage-results
121+
path: |
122+
.coverage
123+
coverage-html
124+
coverage.json
125+
retention-days: 5

.github/workflows/test-check.yaml

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@ on:
44
branches:
55
- main
66
push:
7+
workflow_dispatch:
8+
inputs:
9+
code_coverage:
10+
description: if enabled, code coverage metrics will be collected during the test run
11+
type: boolean
12+
default: false
713

814
env:
915
CADENCE: "commit"
@@ -36,8 +42,21 @@ jobs:
3642
BUILD_TYPE=nightly pip3 install .
3743
- name: "Clean compressed-tensors directory"
3844
run: rm -r compressed-tensors/
45+
- name: "⚙️ Prepare code coverage"
46+
if: inputs.code_coverage
47+
uses: ./.github/actions/prepare-code-coverage
3948
- name: "🔬 Running base tests"
4049
run: make test
50+
- name: "Upload coverage report"
51+
if: (success() || failure()) && inputs.code_coverage
52+
uses: actions/upload-artifact@v4
53+
with:
54+
name: base-tests-coverage-results
55+
path: |
56+
.coverage
57+
coverage-html
58+
coverage.json
59+
retention-days: 5
4160

4261
pytorch-tests:
4362
runs-on: ubuntu-22.04
@@ -65,9 +84,23 @@ jobs:
6584
BUILD_TYPE=nightly pip3 install .
6685
- name: "Clean compressed-tensors directory"
6786
run: rm -r compressed-tensors/
87+
- name: "⚙️ Prepare code coverage"
88+
if: inputs.code_coverage
89+
uses: ./.github/actions/prepare-code-coverage
6890
- name: "🔬 Running pytorch tests"
6991
run: |
7092
pytest -v tests/llmcompressor/pytorch
93+
- name: "Upload coverage report"
94+
if: (success() || failure()) && inputs.code_coverage
95+
uses: actions/upload-artifact@v4
96+
with:
97+
name: pytorch-tests-coverage-results
98+
path: |
99+
.coverage
100+
coverage-html
101+
coverage.json
102+
retention-days: 5
103+
71104

72105
compat-pytorch-1_9-pytorch-tests:
73106
runs-on: ubuntu-22.04
@@ -95,6 +128,19 @@ jobs:
95128
BUILD_TYPE=nightly pip3 install .
96129
- name: "Clean compressed-tensors directory"
97130
run: rm -r compressed-tensors/
131+
- name: "⚙️ Prepare code coverage"
132+
if: inputs.code_coverage
133+
uses: ./.github/actions/prepare-code-coverage
98134
- name: "🔬 Running pytorch tests"
99135
run: |
100136
pytest -v tests/llmcompressor/pytorch
137+
- name: "Upload coverage report"
138+
if: (success() || failure()) && inputs.code_coverage
139+
uses: actions/upload-artifact@v4
140+
with:
141+
name: compat-pytorch-tests-coverage-results
142+
path: |
143+
.coverage
144+
coverage-html
145+
coverage.json
146+
retention-days: 5

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
1818

1919
Some of the exciting new features include:
2020

21-
* **Large Model Support with Sequential Onloading** As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py).
21+
* **Llama4 Quantization Support**: Quantize a Llama4 model to [W4A16](examples/multimodal_vision/llama4_example.py) or [NVFP4](examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM.
22+
* **Large Model Support with Sequential Onloading**: As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py).
2223
* **Preliminary FP4 Quantization Support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4/llama3_example.py) and [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py). Support is currently preliminary and additional support will be added for MoEs.
2324
* **Updated AWQ Support:** Improved support for MoEs with better handling of larger models
2425
* **Axolotl Sparse Finetuning Integration:** Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
25-
* **Day 0 Llama 4 Support:** Meta utilized LLM Compressor to create the [FP8-quantized Llama-4-Maverick-17B-128E](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8), optimized for vLLM inference using [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format.
2626

2727
### Supported Formats
2828
* Activation Quantization: W8A8 (int8 and fp8)

examples/big_models_with_sequential_onloading/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ The Llama 3.3 70b is larger than 80 GB, surpassing the size of 1 A100. However,
1818

1919
```python
2020
model_id = "meta-llama/Llama-3.3-70B-Instruct"
21-
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
21+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map=None)
2222
```
2323

2424
The model is first loaded onto the `cpu`, as indicated through the use of `None` for the `device_map` argument in the `from_pretrained` method when loading the model.
@@ -42,4 +42,4 @@ output = model.generate(**sample, max_new_tokens=100)
4242
print(tokenizer.decode(output[0]))
4343
```
4444

45-
Finally, we call `dispatch_for_generation` to evenly load the model across available devices (potentially offloading the model if required) and run sample generations on the newly quantized model.
45+
Finally, we call `dispatch_for_generation` to evenly load the model across available devices (potentially offloading the model if required) and run sample generations on the newly quantized model.

examples/big_models_with_sequential_onloading/llama3.3_70b.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,11 @@
88

99
# Select model and load it.
1010
model_id = "meta-llama/Llama-3.3-70B-Instruct"
11-
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
11+
model = AutoModelForCausalLM.from_pretrained(
12+
model_id,
13+
torch_dtype="auto",
14+
device_map=None,
15+
)
1216
tokenizer = AutoTokenizer.from_pretrained(model_id)
1317

1418
# Select calibration dataset.
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
import torch
2+
from datasets import load_dataset
3+
from transformers import Llama4ForConditionalGeneration, Llama4Processor
4+
5+
from llmcompressor import oneshot
6+
from llmcompressor.modeling import prepare_for_calibration
7+
from llmcompressor.modifiers.quantization import GPTQModifier
8+
9+
# Select model and load it.
10+
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
11+
model = Llama4ForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
12+
processor = Llama4Processor.from_pretrained(model_id)
13+
# We update `Llama4TextMoe` modules with custom `SequentialLlama4TextMoe`.
14+
# This change allows compatibility with vllm.
15+
# To apply your own custom module for experimentation, consider updating
16+
# `SequentialLlama4TextMoe` under llmcompressor/modeling/llama4.py
17+
model = prepare_for_calibration(model)
18+
19+
DATASET_ID = "neuralmagic/calibration"
20+
NUM_CALIBRATION_SAMPLES = 512
21+
MAX_SEQUENCE_LENGTH = 8192
22+
23+
ds = load_dataset(DATASET_ID, name="LLM", split=f"train[:{NUM_CALIBRATION_SAMPLES}]")
24+
25+
26+
def preprocess_function(example):
27+
messgages = []
28+
for message in example["messages"]:
29+
messgages.append(
30+
{
31+
"role": message["role"],
32+
"content": [{"type": "text", "text": message["content"]}],
33+
}
34+
)
35+
36+
return processor.apply_chat_template(
37+
messgages,
38+
return_tensors="pt",
39+
padding=False,
40+
truncation=True,
41+
max_length=MAX_SEQUENCE_LENGTH,
42+
tokenize=True,
43+
add_special_tokens=False,
44+
return_dict=True,
45+
add_generation_prompt=False,
46+
)
47+
48+
49+
ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)
50+
51+
52+
def data_collator(batch):
53+
assert len(batch) == 1
54+
return {
55+
key: torch.tensor(value)
56+
if key != "pixel_values"
57+
else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
58+
for key, value in batch[0].items()
59+
}
60+
61+
62+
# Configure the quantization algorithm to run.
63+
recipe = GPTQModifier(
64+
targets="Linear",
65+
scheme="W4A16",
66+
ignore=[
67+
"re:.*lm_head",
68+
"re:.*self_attn",
69+
"re:.*router",
70+
"re:vision_model.*",
71+
"re:multi_modal_projector.*",
72+
"Llama4TextAttention",
73+
],
74+
)
75+
76+
# Apply algorithms.
77+
# due to the large size of Llama4, we specify sequential targets such that
78+
# only one MLP is loaded into GPU memory at a time
79+
oneshot(
80+
model=model,
81+
dataset=ds,
82+
recipe=recipe,
83+
max_seq_length=MAX_SEQUENCE_LENGTH,
84+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
85+
data_collator=data_collator,
86+
sequential_targets=["Llama4TextMLP"],
87+
)
88+
89+
# Save to disk compressed.
90+
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128"
91+
model.save_pretrained(SAVE_DIR, save_compressed=True)
92+
processor.save_pretrained(SAVE_DIR)

examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
# NOTE: Fine tuning can require more steps than is shown in the example
2+
# See the Axolotl integration blog post for best fine tuning practices
3+
# https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open
4+
15
from pathlib import Path
26

37
import torch
@@ -74,6 +78,7 @@
7478
)
7579

7680
# Sparse finetune
81+
# This step can be supplanted by fine tuning via integrated FT libraries such as Axolotl
7782
train(
7883
model=(output_path / "sparsity_stage"),
7984
**oneshot_kwargs,
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
import torch
2+
from datasets import load_dataset
3+
from transformers import Llama4ForConditionalGeneration, Llama4Processor
4+
5+
from llmcompressor import oneshot
6+
from llmcompressor.modeling import prepare_for_calibration
7+
from llmcompressor.modifiers.quantization import QuantizationModifier
8+
9+
# Select model and load it.
10+
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
11+
model = Llama4ForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
12+
processor = Llama4Processor.from_pretrained(model_id)
13+
# We update `Llama4TextMoe` modules with custom `SequentialLlama4TextMoe`.
14+
# This change allows compatibility with vllm.
15+
# To apply your own custom module for experimentation, consider updating
16+
# `SequentialLlama4TextMoe` under llmcompressor/modeling/llama4.py
17+
model = prepare_for_calibration(model)
18+
19+
DATASET_ID = "neuralmagic/calibration"
20+
NUM_CALIBRATION_SAMPLES = 20
21+
MAX_SEQUENCE_LENGTH = 8192
22+
23+
ds = load_dataset(DATASET_ID, name="LLM", split=f"train[:{NUM_CALIBRATION_SAMPLES}]")
24+
25+
26+
def preprocess_function(example):
27+
messgages = []
28+
for message in example["messages"]:
29+
messgages.append(
30+
{
31+
"role": message["role"],
32+
"content": [{"type": "text", "text": message["content"]}],
33+
}
34+
)
35+
36+
return processor.apply_chat_template(
37+
messgages,
38+
return_tensors="pt",
39+
padding=False,
40+
truncation=True,
41+
max_length=MAX_SEQUENCE_LENGTH,
42+
tokenize=True,
43+
add_special_tokens=False,
44+
return_dict=True,
45+
add_generation_prompt=False,
46+
)
47+
48+
49+
ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)
50+
51+
52+
def data_collator(batch):
53+
assert len(batch) == 1
54+
return {
55+
key: torch.tensor(value)
56+
if key != "pixel_values"
57+
else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
58+
for key, value in batch[0].items()
59+
}
60+
61+
62+
# Configure the quantization algorithm to run.
63+
recipe = QuantizationModifier(
64+
targets="Linear",
65+
scheme="NVFP4",
66+
ignore=[
67+
"re:.*lm_head",
68+
"re:.*self_attn",
69+
"re:.*router",
70+
"re:vision_model.*",
71+
"re:multi_modal_projector.*",
72+
"Llama4TextAttention",
73+
],
74+
)
75+
76+
# Apply algorithms.
77+
# due to the large size of Llama4, we specify sequential targets such that
78+
# only one MLP is loaded into GPU memory at a time
79+
oneshot(
80+
model=model,
81+
dataset=ds,
82+
recipe=recipe,
83+
max_seq_length=MAX_SEQUENCE_LENGTH,
84+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
85+
sequential_targets=["Llama4TextMLP"],
86+
data_collator=data_collator,
87+
)
88+
89+
90+
# Save to disk compressed.
91+
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-NVFP4"
92+
model.save_pretrained(SAVE_DIR)
93+
processor.save_pretrained(SAVE_DIR)

0 commit comments

Comments
 (0)