Adjust tolerance for fp16 exp & gelu ops test to handle reasonable calculation discrepancies (#12150)

leafs1 · web-flow · commit 29858b4fedf4 · 2025-07-02T21:24:43.000-07:00
### Summary
This PR improves the exp_fp16 and gelu_fp16 tests by using a dynamic
tolerance strategy similar to the XNNPACK tolerance calculation for
validating float16 exponential kernels. Instead of relying on fixed
absolute and relative tolerances, the test now calculates acceptable
error bounds based on the output magnitude and float16 precision
constraints. This change ensures correctness while accommodating the
inherent limitations of float16 arithmetic.

### Problem
While testing the float16 exponential kernel from XNNPACK against
PyTorch's eager-mode implementation, sparse errors occured. The failures
were due to small mismatches between the output values, often in the
range of ~0.01 to ~0.015. These discrepancies occurred despite both
outputs being reasonably close when viewed through the lens of float16
precision. The original test used fixed tolerance values (atol=1e-3,
rtol=1e-3), which were too strict for float16 results, particularly for
inputs that produced large exponentials.

### Investigation
To understand the failures, I traced specific cases where discrepancies
occurred. For example, for the input 2.2715, PyTorch computes
exp(2.2715) in float32 and rounds the result to float16, yielding
9.6953. In contrast, XNNPACK uses float16-only arithmetic throughout its
kernel, computing a slightly lower value of 9.6797. The difference
between the two outputs is exactly 0.0156, which corresponds to one ULP
(unit in the last place) at that magnitude in float16. This led me to
examine the structure of float16 and its numerical limits in detail.

Further analysis revealed that IEEE 754 half-precision floating point
(float16) has a limited resolution — only 10 bits for the significand —
meaning the spacing between representable values increases with
magnitude. Near 1.0, the ULP is about 0.00098, but near 9.7, it rises to
0.0156. Given this, it became clear that small absolute differences in
the output were not only expected but within the bounds of what float16
can actually represent.

To confirm the root cause, I reviewed the XNNPACK source code and
documentation. Their float16 exponential kernel uses a 2^z * 2^r
decomposition and evaluates a degree-3 polynomial using multiple steps
of float16 arithmetic exclusively, which introduces a lot of error. More
importantly, I found that XNNPACK’s own test infrastructure accepts
outputs within a mixed tolerance of 2 × ε absolute and 6 × ε relative
error, where ε ≈ 9.77e-4 is the machine epsilon for float16. This
tolerance model is defined by their TolMixed function and effectively
allows up to ~6 ULPs of error, depending on the output value.

### Solution
This PR updates the exp_fp16 and gelu_fp16 tests to use the same
tolerance policy as XNNPACK. For float16 inputs, the test now computes
the reference output using float32 precision, then applies the following
tolerance calculation:
Absolute tolerance: 2 × ε ≈ 0.00195
Relative tolerance: 6 × ε ≈ 0.00586
Final tolerance per output: max(atol, rtol × |y_ref|)

### Test plan
I tested this by adding the new rtol and atol values to the test suite
and running the tests with various random inputs to ensure that the
tests pass.
diff --git a/backends/xnnpack/test/ops/test_exp.py b/backends/xnnpack/test/ops/test_exp.py
@@ -10,6 +10,23 @@
 from executorch.backends.xnnpack.test.tester import Tester
 
 
+def calculate_fp16_exp_tolerance(ref_output_tensor):
+    # Calculate mixed tolerance for float16 used in XNNPACK's float16 policy
+    fp16_epsilon = 9.77e-4
+    abs_tol = 2 * fp16_epsilon
+    rel_tol = 6 * fp16_epsilon
+
+    ref_abs = ref_output_tensor.abs()
+    mixed_tol = torch.maximum(
+        torch.full_like(ref_abs, abs_tol),
+        ref_abs * rel_tol,
+    )
+
+    final_atol = mixed_tol.max().item()
+
+    return final_atol, rel_tol
+
+
 class TestExp(unittest.TestCase):
     def setUp(self):
         torch._dynamo.reset()
@@ -22,6 +39,16 @@ def forward(self, x):
             return torch.exp(x)
 
     def run_exp_test(self, inputs):
+        input_tensor = inputs[0]
+
+        if input_tensor.dtype == torch.float16:
+            with torch.no_grad():
+                ref_output = torch.exp(input_tensor.to(torch.float32)).to(torch.float16)
+            atol, rtol = calculate_fp16_exp_tolerance(ref_output)
+        else:
+            atol = 1e-03
+            rtol = 1e-03
+
         (
             Tester(self.Exp(), inputs)
             .export()
@@ -31,12 +58,9 @@ def run_exp_test(self, inputs):
             .check_not(["executorch_exir_dialects_edge__ops_aten_exp_default"])
             .to_executorch()
             .serialize()
-            .run_method_and_compare_outputs()
+            .run_method_and_compare_outputs(atol=atol, rtol=rtol)
         )
 
-    # TODO (leafs1): Fix flaky tests. Land fix asap
-    # and cherry-pick onto release/0.7 branch
-    @unittest.skip(reason="For float16, numerical discepancies are too high")
     def test_fp16_exp(self):
         inputs = (torch.randn(20).to(torch.float16),)
         self.run_exp_test(inputs)
diff --git a/backends/xnnpack/test/ops/test_gelu.py b/backends/xnnpack/test/ops/test_gelu.py
@@ -10,6 +10,21 @@
 from executorch.backends.xnnpack.test.tester import Tester
 
 
+def calculate_fp16_gelu_tolerance(ref_output_tensor):
+    fp16_epsilon = 9.77e-4
+    abs_tol = 2 * fp16_epsilon
+    rel_tol = 6 * fp16_epsilon
+
+    ref_abs = ref_output_tensor.abs()
+    mixed_tol = torch.maximum(
+        torch.full_like(ref_abs, abs_tol),
+        ref_abs * rel_tol,
+    )
+
+    final_atol = mixed_tol.max().item()
+    return final_atol, rel_tol
+
+
 class TestGelu(unittest.TestCase):
     def setUp(self):
         torch._dynamo.reset()
@@ -23,6 +38,18 @@ def forward(self, x):
             return self.gelu(x)
 
     def run_gelu_test(self, inputs):
+        input_tensor = inputs[0]
+
+        if input_tensor.dtype == torch.float16:
+            with torch.no_grad():
+                ref_output = torch.nn.functional.gelu(
+                    input_tensor.to(torch.float32)
+                ).to(torch.float16)
+            atol, rtol = calculate_fp16_gelu_tolerance(ref_output)
+        else:
+            atol = 1e-03
+            rtol = 1e-03
+
         (
             Tester(self.Gelu(), inputs)
             .export()
@@ -32,7 +59,7 @@ def run_gelu_test(self, inputs):
             .check_not(["executorch_exir_dialects_edge__ops_aten_gelu_default"])
             .to_executorch()
             .serialize()
-            .run_method_and_compare_outputs()
+            .run_method_and_compare_outputs(atol=atol, rtol=rtol)
         )
 
     def test_fp16_gelu(self):