Allow 1 manitissa bit diff in TestFused8BitRowwiseQuantizationConversion (#2015)

sryap · spcyppt · commit a3446aeea48b · 2023-09-25T14:19:07.000-07:00
Summary: Pull Request resolved: #2015 The reference implementation of FP8 quantization is in Python, but the actual implementation is in C++/CUDA. Upon summerdengfb's investigation, Python has a known floating point representation issue (https://www.geeksforgeeks.org/floating-point-error-in-python/). This could cause quantization result discrepancy. To workaround this issue, we allow 1 bit difference in the FP8 quantization result (LSB of mantissa) in `TestFused8BitRowwiseQuantizationConversion`. Reviewed By: q10, shintaro-iwasaki Differential Revision: D49255499 fbshipit-source-id: b28294f8076bda61589e10699119375f03b091a8
diff --git a/fbgemm_gpu/test/quantize_ops_test.py b/fbgemm_gpu/test/quantize_ops_test.py
@@ -118,7 +118,10 @@ def test_quantize_op(
             ncols_aligned = (ncols + 4 - 1) // 4 * 4
             # compare quantized data
             np.testing.assert_allclose(
-                quantized_data_numpy[:, :ncols], reference[:, :ncols]
+                quantized_data_numpy[:, :ncols],
+                reference[:, :ncols],
+                # Allow 1 mantissa bit difference (LSB)
+                atol=1,
             )
             # compare scales
             np.testing.assert_array_almost_equal(