Skip to content

Commit a94ffd8

Browse files
Merge branch 'main' into kylesayrs/transform-modifier
2 parents b521cfa + 50bb656 commit a94ffd8

File tree

90 files changed

+868
-1044
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

90 files changed

+868
-1044
lines changed

CITATION.cff

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
cff-version: 1.2.0
2+
message: "If you use this software, please cite it as below."
3+
authors:
4+
- name: Red Hat AI
5+
- name: vLLM Project
6+
title: "LLM Compressor"
7+
date-released: 2024-08-08
8+
url: https://github.com/vllm-project/llm-compressor

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,14 @@ quality:
2626
@echo "Running python quality checks";
2727
ruff check $(CHECKDIRS);
2828
isort --check-only $(CHECKDIRS);
29-
flake8 $(CHECKDIRS) --max-line-length 88 --extend-ignore E203;
29+
flake8 $(CHECKDIRS) --max-line-length 88 --extend-ignore E203,W605;
3030

3131
# style the code according to accepted standards for the repo
3232
style:
3333
@echo "Running python styling";
3434
ruff format $(CHECKDIRS);
3535
isort $(CHECKDIRS);
36-
flake8 $(CHECKDIRS) --max-line-length 88 --extend-ignore E203;
36+
flake8 $(CHECKDIRS) --max-line-length 88 --extend-ignore E203,W605;
3737

3838
# run tests for the repo
3939
test:

README.md

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,14 @@
1616

1717
Big updates have landed in LLM Compressor! Check out these exciting new features:
1818

19-
* **FP4 Weight Only Quantization Support:** Quantize weights to FP4 and seamlessly run the compressed model in vLLM. Model weights are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/1b6287a4b21c16e0842f32fadecb20bb4c0d4862/src/compressed_tensors/quantization/quant_scheme.py#L103). See an example [here](examples/quantization_w4a16_fp4/llama3_example.py).
20-
* **Axolotl Sparse Finetuning Integration:** Easily finetune sparse LLMs through our seamless integration with Axolotl. [Learn more here](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
19+
* **Preliminary FP4 Quantization Support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4/llama3_example.py) and [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py). Support is currently preliminary and additional support will be added for MoEs.
20+
* **Axolotl Sparse Finetuning Integration:** Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
2121
* **AutoAWQ Integration:** Perform low-bit weight-only quantization efficiently using AutoAWQ, now part of LLM Compressor. *Note: This integration should be considered experimental for now. Enhanced support, including for MoE models and improved handling of larger models via layer sequential pipelining, is planned for upcoming releases.* [See the details](https://github.com/vllm-project/llm-compressor/pull/1177).
2222
* **Day 0 Llama 4 Support:** Meta utilized LLM Compressor to create the [FP8-quantized Llama-4-Maverick-17B-128E](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8), optimized for vLLM inference using [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format.
2323

2424
### Supported Formats
2525
* Activation Quantization: W8A8 (int8 and fp8)
26-
* Mixed Precision: W4A16, W8A16, NVFP4A16
26+
* Mixed Precision: W4A16, W8A16, NVFP4 (W4A4 and W4A16 support)
2727
* 2:4 Semi-structured and Unstructured Sparsity
2828

2929
### Supported Algorithms
@@ -51,6 +51,7 @@ pip install llmcompressor
5151
Applying quantization with `llmcompressor`:
5252
* [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md)
5353
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md)
54+
* [Activation quantization to `fp4`](examples/quantization_w4a4_fp4/llama3_example.py)
5455
* [Weight only quantization to `fp4`](examples/quantization_w4a16_fp4/llama3_example.py)
5556
* [Weight only quantization to `int4` using GPTQ](examples/quantization_w4a16/README.md)
5657
* [Weight only quantization to `int4` using AWQ](examples/awq/README.md)
@@ -119,3 +120,17 @@ output = model.generate("My name is")
119120

120121
- If you have any questions or requests open an [issue](https://github.com/vllm-project/llm-compressor/issues) and we will add an example or documentation.
121122
- We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! [Learn how here](CONTRIBUTING.md).
123+
124+
## Citation
125+
126+
If you find LLM Compressor useful in your research or projects, please consider citing it:
127+
128+
```bibtex
129+
@software{llmcompressor2024,
130+
title={{LLM Compressor}},
131+
author={Red Hat AI and vLLM Project},
132+
year={2024},
133+
month={8},
134+
url={https://github.com/vllm-project/llm-compressor},
135+
}
136+
```

examples/awq/README.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,7 @@ recipe = [
1818
To use your own model, start with an existing example change the `model_id` to match your own model stub.
1919
```python
2020
model_id = "path/to/your/model"
21-
model = AutoModelForCausalLM.from_pretrained(
22-
model_id,
23-
device_map="auto",
24-
torch_dtype="auto",
25-
)
21+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
2622
```
2723

2824
## Adding Mappings ##

examples/awq/llama_example.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,7 @@
77
# Select model and load it.
88
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
99

10-
model = AutoModelForCausalLM.from_pretrained(
11-
MODEL_ID, device_map="auto", torch_dtype="auto"
12-
)
10+
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
1311
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
1412

1513
# Select calibration dataset.
@@ -72,6 +70,6 @@ def tokenize(sample):
7270
print("==========================================\n\n")
7371

7472
# Save to disk compressed.
75-
SAVE_DIR = MODEL_ID.split("/")[-1] + "-awq-asym"
73+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-awq-asym"
7674
model.save_pretrained(SAVE_DIR, save_compressed=True)
7775
tokenizer.save_pretrained(SAVE_DIR)

examples/awq/qwen3_moe_example.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,12 @@
33

44
from llmcompressor import oneshot
55
from llmcompressor.modifiers.awq import AWQModifier
6+
from llmcompressor.utils import dispatch_for_generation
67

78
# Select model and load it.
89
MODEL_ID = "Qwen/Qwen3-30B-A3B"
910

10-
model = AutoModelForCausalLM.from_pretrained(
11-
MODEL_ID, device_map="auto", torch_dtype="auto"
12-
)
11+
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
1312
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
1413

1514
# Select calibration dataset.
@@ -71,12 +70,13 @@ def tokenize(sample):
7170
# Confirm generations of the quantized model look sane.
7271
print("\n\n")
7372
print("========== SAMPLE GENERATION ==============")
73+
dispatch_for_generation(model)
7474
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
7575
output = model.generate(input_ids, max_new_tokens=100)
7676
print(tokenizer.decode(output[0]))
7777
print("==========================================\n\n")
7878

7979
# Save to disk compressed.
80-
SAVE_DIR = MODEL_ID.split("/")[-1] + "-awq-sym"
80+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-awq-sym"
8181
model.save_pretrained(SAVE_DIR, save_compressed=True)
8282
tokenizer.save_pretrained(SAVE_DIR)

examples/big_models_with_accelerate/README.md

Lines changed: 0 additions & 95 deletions
This file was deleted.

examples/big_models_with_accelerate/cpu_offloading_fp8.py

Lines changed: 0 additions & 26 deletions
This file was deleted.

examples/big_models_with_accelerate/mult_gpus_int8_device_map.py

Lines changed: 0 additions & 81 deletions
This file was deleted.

0 commit comments

Comments
 (0)