You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -17,7 +18,20 @@ Float point 8 (FP8) is a promising data type for low precision quantization whic
17
18
18
19
Intel Gaudi2, also known as HPU, provides this data type capability for low precision quantization, which includes `E4M3` and `E5M2`. For more information about these two data type, please refer to [link](https://arxiv.org/abs/2209.05433).
19
20
20
-
Intel Neural Compressor provides general quantization APIs to leverage HPU FP8 capability. with simple with lower memory usage and lower compute cost, 8 bit model
21
+
To harness FP8 capabilities — offering reduced memory usage and lower computational costs — Intel Neural Compressor provides general quantization APIs to generate FP8 models.
22
+
23
+
## Support Matrix
24
+
25
+
| Hardware | FP8 mode | FP8 QDQ mode |
26
+
| :------- |:--------|:---------|
27
+
| HPU |✔|✔|
28
+
| CPU |✕|✔|
29
+
30
+
For FP8 mode, tensors are all represented in FP8 format and kernels are replaced to FP8 version explicitly.
31
+
32
+
For FP8 QDQ mode, activations are still in high precision and quant/dequant pairs are inserted. Frameworks can compile and fuse operators of FP8 QDQ model based on their own capability.
33
+
34
+
During runtime, Intel Neural Compressor will detect hardware automatically and the priority is HPU > CPU.
0 commit comments