Skip to content

Commit d04fb1c

Browse files
authored
add CPU FP8 QDQ doc (#2240)
Signed-off-by: Mengni Wang <mengni.wang@intel.com>
1 parent 2c8d8af commit d04fb1c

File tree

1 file changed

+19
-5
lines changed

1 file changed

+19
-5
lines changed

docs/source/3x/PT_FP8Quant.md

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,11 @@ FP8 Quantization
22
=======
33

44
1. [Introduction](#introduction)
5-
2. [Supported Parameters](#supported-parameters)
6-
3. [Get Start with FP8 Quantization](#get-start-with-fp8-quantization)
7-
4. [Optimum-habana LLM example](#optimum-habana-LLM-example)
8-
5. [VLLM example](#VLLM-example)
5+
2. [Support Matrix](#support-matrix)
6+
3. [Supported Parameters](#supported-parameters)
7+
4. [Get Start with FP8 Quantization](#get-start-with-fp8-quantization)
8+
5. [Optimum-habana LLM example](#optimum-habana-LLM-example)
9+
6. [VLLM example](#VLLM-example)
910

1011
## Introduction
1112

@@ -17,7 +18,20 @@ Float point 8 (FP8) is a promising data type for low precision quantization whic
1718

1819
Intel Gaudi2, also known as HPU, provides this data type capability for low precision quantization, which includes `E4M3` and `E5M2`. For more information about these two data type, please refer to [link](https://arxiv.org/abs/2209.05433).
1920

20-
Intel Neural Compressor provides general quantization APIs to leverage HPU FP8 capability. with simple with lower memory usage and lower compute cost, 8 bit model
21+
To harness FP8 capabilities — offering reduced memory usage and lower computational costs — Intel Neural Compressor provides general quantization APIs to generate FP8 models.
22+
23+
## Support Matrix
24+
25+
| Hardware | FP8 mode | FP8 QDQ mode |
26+
| :------- |:--------|:---------|
27+
| HPU | &#10004; | &#10004; |
28+
| CPU | &#10005; | &#10004; |
29+
30+
For FP8 mode, tensors are all represented in FP8 format and kernels are replaced to FP8 version explicitly.
31+
32+
For FP8 QDQ mode, activations are still in high precision and quant/dequant pairs are inserted. Frameworks can compile and fuse operators of FP8 QDQ model based on their own capability.
33+
34+
During runtime, Intel Neural Compressor will detect hardware automatically and the priority is HPU > CPU.
2135

2236
## Supported Parameters
2337

0 commit comments

Comments
 (0)