Skip to content

Commit a3446ae

Browse files
sryapspcyppt
authored andcommitted
Allow 1 manitissa bit diff in TestFused8BitRowwiseQuantizationConversion (#2015)
Summary: Pull Request resolved: #2015 The reference implementation of FP8 quantization is in Python, but the actual implementation is in C++/CUDA. Upon summerdengfb's investigation, Python has a known floating point representation issue (https://www.geeksforgeeks.org/floating-point-error-in-python/). This could cause quantization result discrepancy. To workaround this issue, we allow 1 bit difference in the FP8 quantization result (LSB of mantissa) in `TestFused8BitRowwiseQuantizationConversion`. Reviewed By: q10, shintaro-iwasaki Differential Revision: D49255499 fbshipit-source-id: b28294f8076bda61589e10699119375f03b091a8
1 parent 7927220 commit a3446ae

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

fbgemm_gpu/test/quantize_ops_test.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,10 @@ def test_quantize_op(
118118
ncols_aligned = (ncols + 4 - 1) // 4 * 4
119119
# compare quantized data
120120
np.testing.assert_allclose(
121-
quantized_data_numpy[:, :ncols], reference[:, :ncols]
121+
quantized_data_numpy[:, :ncols],
122+
reference[:, :ncols],
123+
# Allow 1 mantissa bit difference (LSB)
124+
atol=1,
122125
)
123126
# compare scales
124127
np.testing.assert_array_almost_equal(

0 commit comments

Comments
 (0)