Skip to content

Commit f773617

Browse files
[Docs] Update ReadME details for FP4 (#1519)
Summary: - We should land this once the vLLM integration lands: vllm-project/vllm#18312 --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent 341e27c commit f773617

File tree

1 file changed

+3
-2
lines changed

1 file changed

+3
-2
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,14 @@
1616

1717
Big updates have landed in LLM Compressor! Check out these exciting new features:
1818

19-
* **FP4 Weight Only Quantization Support:** Quantize weights to FP4 and seamlessly run the compressed model in vLLM. Model weights are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/1b6287a4b21c16e0842f32fadecb20bb4c0d4862/src/compressed_tensors/quantization/quant_scheme.py#L103). See an example [here](examples/quantization_w4a16_fp4/llama3_example.py).
19+
* **Preliminary FP4 Quantization Support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4/llama3_example.py) and [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py). Support is currently preliminary and additional support will be added for MoEs.
2020
* **Axolotl Sparse Finetuning Integration:** Easily finetune sparse LLMs through our seamless integration with Axolotl. [Learn more here](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
2121
* **AutoAWQ Integration:** Perform low-bit weight-only quantization efficiently using AutoAWQ, now part of LLM Compressor. *Note: This integration should be considered experimental for now. Enhanced support, including for MoE models and improved handling of larger models via layer sequential pipelining, is planned for upcoming releases.* [See the details](https://github.com/vllm-project/llm-compressor/pull/1177).
2222
* **Day 0 Llama 4 Support:** Meta utilized LLM Compressor to create the [FP8-quantized Llama-4-Maverick-17B-128E](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8), optimized for vLLM inference using [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format.
2323

2424
### Supported Formats
2525
* Activation Quantization: W8A8 (int8 and fp8)
26-
* Mixed Precision: W4A16, W8A16, NVFP4A16
26+
* Mixed Precision: W4A16, W8A16, NVFP4 (W4A4 and W4A16 support)
2727
* 2:4 Semi-structured and Unstructured Sparsity
2828

2929
### Supported Algorithms
@@ -51,6 +51,7 @@ pip install llmcompressor
5151
Applying quantization with `llmcompressor`:
5252
* [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md)
5353
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md)
54+
* [Activation quantization to `fp4`](examples/quantization_w4a4_fp4/llama3_example.py)
5455
* [Weight only quantization to `fp4`](examples/quantization_w4a16_fp4/llama3_example.py)
5556
* [Weight only quantization to `int4` using GPTQ](examples/quantization_w4a16/README.md)
5657
* [Weight only quantization to `int4` using AWQ](examples/awq/README.md)

0 commit comments

Comments
 (0)