[TRTLLM-4480][doc] Documentation for new accuracy test suite and trtllm-eval (#3946)

syuoni · web-flow · commit 74df12bbaa22 · 2025-05-08T19:35:23.000+08:00
* fix formula

Signed-off-by: Enwei Zhu &lt;21126786+syuoni@users.noreply.github.com&gt;

* update doc

Signed-off-by: Enwei Zhu &lt;21126786+syuoni@users.noreply.github.com&gt;

* fix

Signed-off-by: Enwei Zhu &lt;21126786+syuoni@users.noreply.github.com&gt;

* 1st version

Signed-off-by: Enwei Zhu &lt;21126786+syuoni@users.noreply.github.com&gt;

* polish

Signed-off-by: Enwei Zhu &lt;21126786+syuoni@users.noreply.github.com&gt;

* fix

Signed-off-by: Enwei Zhu &lt;21126786+syuoni@users.noreply.github.com&gt;

---------

Signed-off-by: Enwei Zhu &lt;21126786+syuoni@users.noreply.github.com&gt;
diff --git a/examples/trtllm-eval/README.md b/examples/trtllm-eval/README.md
@@ -0,0 +1,25 @@
+# Accuracy Evaluation Tool `trtllm-eval`
+
+We provide a CLI tool `trtllm-eval` for evaluating model accuracy. It shares the core evaluation logics with the [accuracy test suite](../../tests/integration/defs/accuracy) of TensorRT-LLM.
+
+`trtllm-eval` is built on the offline API -- [LLM API](https://nvidia.github.io/TensorRT-LLM/llm-api/index.html). It provides developers a unified entrypoint for accuracy evaluation. Compared with the online API [`trtllm-serve`](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve.html), offline API provides clearer error messages and simplifies the debugging workflow.
+
+`trtllm-eval` follows the CLI interface of [`trtllm-serve`](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve.html).
+
+```bash
+pip install -r requirements.txt
+
+# Evaluate Llama-3.1-8B-Instruct on MMLU
+wget https://people.eecs.berkeley.edu/~hendrycks/data.tar && tar -xf data.tar
+trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct mmlu --dataset_path data
+
+# Evaluate Llama-3.1-8B-Instruct on GSM8K
+trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct gsm8k
+
+# Evaluate Llama-3.3-70B-Instruct on GPQA Diamond
+trtllm-eval --model meta-llama/Llama-3.3-70B-Instruct gpqa_diamond
+```
+
+The `--model` argument accepts either a Hugging Face model ID or a local checkpoint path. By default, `trtllm-eval` runs the model with the PyTorch backend; pass `--backend tensorrt` to switch to the TensorRT backend. Alternatively, the `--model` argument also accepts a local path to pre-built TensorRT engines; in that case, please pass the Hugging Face tokenizer path to the `--tokenizer` argument.
+
+See more details by `trtllm-eval --help`.
diff --git a/examples/trtllm-eval/requirements.txt b/examples/trtllm-eval/requirements.txt
@@ -0,0 +1 @@
+lm_eval[api]==0.4.8
diff --git a/tensorrt_llm/evaluate/interface.py b/tensorrt_llm/evaluate/interface.py
@@ -83,6 +83,5 @@ def evaluate(self,
         return score
 
     @staticmethod
-    @abstractmethod
     def command(ctx, *args, **kwargs) -> None:
         raise NotImplementedError()
diff --git a/tests/README.md b/tests/README.md
@@ -55,7 +55,7 @@ pip install -r requirements-dev.txt
 cd tests/integration/defs
 
 # example 1: run a case
-pytest "accuracy/test_cli_flow.py::TestGpt2CnnDailymail::test_auto_dtype"
+pytest "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B::test_auto_dtype"
 
 # example 2: run a test list
 pytest --rootdir . --test-list=<a txt file contains on test case per line>
@@ -98,7 +98,7 @@ For more options, refer to pytest --help, paying attention to Custom options add
     When you finish setup the model directory, remember to mount it in the docker container.
 
 
-## 4. C++ runtime test
+## 3. C++ runtime test
 
 TRT-LLM C++ runtime tests are using [google-test](https://github.com/google/googletest) framework, and Pytest is used to run sets of these tests.
 
@@ -107,7 +107,7 @@ Pytest calls these scripts from fixtures prior to launching the test cases.
 
 Details on usage of the resources scripts can be found in the [C++ Test document](../cpp/tests/README.md).
 
-## 5. Performance regression test
+## 4. Performance regression test
 
 For performance regression testing in QA and CI, see the [performance test guide](./integration/README.md).
 
@@ -133,7 +133,9 @@ The priority is A10 > A30 > L40s > A100 > H100 > B200.
 
 Integrations tests usually run entire workflow, containing checkpoint converting, engine building and evaluating, to check functional and accuracy.
 
-Integration tests are stored in `integration/defs`. Once a new integration test case is added, the yml files must be updated to contain the newly added case. Otherwise, the CI will not be able to collect and run this case.
+Integration tests are stored in [`integration/defs`](./integration/defs). In particular, please see [`integration/defs/accuracy`](./integration/defs/accuracy) for more detailed guidance to add accuracy tests.
+
+Once a new integration test case is added, the yml files must be updated to contain the newly added case. Otherwise, the CI will not be able to collect and run this case.
 
 ## 3. Add a unit test
 
diff --git a/tests/integration/defs/accuracy/README.md b/tests/integration/defs/accuracy/README.md
diff --git a/tests/integration/defs/accuracy/accuracy_core.py b/tests/integration/defs/accuracy/accuracy_core.py
@@ -37,24 +37,32 @@
 from ..trt_test_alternative import check_call, exists
 
 
+def compute_theta(num_samples: int,
+                  sigma: float,
+                  alpha: float = 0.05,
+                  beta: float = 0.2):
+    scale = (2 * sigma**2 / num_samples)**0.5
+
+    # Single-tail testing
+    z_alpha = scipy.stats.norm.ppf(alpha)
+    z_beta = scipy.stats.norm.ppf(beta)
+    theta = -(z_alpha + z_beta) * scale
+    return theta
+
+
 def compute_threshold(num_samples: int,
                       ref_accuracy: float,
                       sigma: float,
                       alpha: float = 0.05,
-                      beta: float = 0.2,
                       higher_is_better: bool = True):
     scale = (2 * sigma**2 / num_samples)**0.5
 
     # Single-tail testing
     z_alpha = scipy.stats.norm.ppf(alpha)
     if higher_is_better:
-        threshold = ref_accuracy + z_alpha * scale
+        return ref_accuracy + z_alpha * scale
     else:
-        threshold = ref_accuracy - z_alpha * scale
-
-    z_beta = scipy.stats.norm.ppf(beta)
-    theta = -(z_alpha + z_beta) * scale
-    return threshold, theta
+        return ref_accuracy - z_alpha * scale
 
 
 class AccuracyTask:
@@ -82,7 +90,7 @@ class AccuracyTask:
 
     def __init__(self, model_name: str):
         with open(f"{self.REFERENCE_DIR}/{self.DATASET}.yaml") as f:
-            self.reference = yaml.safe_load(f)[model_name]
+            self.reference: List[dict] = yaml.safe_load(f).get(model_name, [])
 
     def get_num_samples_and_threshold(self, **acc_specs):
         """Get num_samples and threshold via accuracy specifications.
@@ -116,12 +124,12 @@ def get_num_samples_and_threshold(self, **acc_specs):
         sigma = entry.get("sigma", self.SIGMA)
         num_samples = entry.get("num_samples", self.NUM_SAMPLES)
         higher_is_better = entry.get("higher_is_better", self.HIGHER_IS_BETTER)
-        threshold, theta = compute_threshold(num_samples,
-                                             accuracy,
-                                             sigma=sigma,
-                                             alpha=alpha,
-                                             beta=beta,
-                                             higher_is_better=higher_is_better)
+        theta = compute_theta(num_samples, sigma=sigma, alpha=alpha, beta=beta)
+        threshold = compute_threshold(num_samples,
+                                      accuracy,
+                                      sigma=sigma,
+                                      alpha=alpha,
+                                      higher_is_better=higher_is_better)
         print("===========================================================\n"
               "= ACCURACY HYPOTHESIS TESTING\n"
               "===========================================================\n"
@@ -256,7 +264,7 @@ class MMLU(AccuracyTask):
     DATASET = "mmlu"
     DATASET_DIR = f"{llm_models_root()}/datasets/mmlu"
 
-    ALPHA = 0.01
+    ALPHA = 0.05
     BETA = 0.2
     SIGMA = 50
     NUM_SAMPLES = 4096
@@ -273,7 +281,7 @@ class GSM8K(AccuracyTask):
     DATASET = "gsm8k"
     DATASET_DIR = f"{llm_models_root()}/datasets/openai/gsm8k"
 
-    ALPHA = 0.02
+    ALPHA = 0.05
     BETA = 0.2
     SIGMA = 50
     NUM_SAMPLES = 1319  # Full sample
diff --git a/tests/integration/defs/accuracy/scripts/compute_theta_and_thresholds.py b/tests/integration/defs/accuracy/scripts/compute_theta_and_thresholds.py
@@ -0,0 +1,82 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+
+import pandas
+import scipy
+
+
+def compute_theta(num_samples: int,
+                  sigma: float,
+                  alpha: float = 0.05,
+                  beta: float = 0.2):
+    scale = (2 * sigma**2 / num_samples)**0.5
+
+    # Single-tail testing
+    z_alpha = scipy.stats.norm.ppf(alpha)
+    z_beta = scipy.stats.norm.ppf(beta)
+    theta = -(z_alpha + z_beta) * scale
+    return theta
+
+
+def compute_threshold(num_samples: int,
+                      ref_accuracy: float,
+                      sigma: float,
+                      alpha: float = 0.05,
+                      higher_is_better: bool = True):
+    scale = (2 * sigma**2 / num_samples)**0.5
+
+    # Single-tail testing
+    z_alpha = scipy.stats.norm.ppf(alpha)
+    if higher_is_better:
+        return ref_accuracy + z_alpha * scale
+    else:
+        return ref_accuracy - z_alpha * scale
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--num_samples_total", type=int, default=8192)
+    parser.add_argument("--sigma", type=float, default=50)
+    parser.add_argument("--alpha", type=float, default=0.05)
+    parser.add_argument("--beta", type=float, default=0.2)
+    args = parser.parse_args()
+
+    data = []
+    num_samples = 32
+    while num_samples < args.num_samples_total:
+        theta = compute_theta(num_samples,
+                              args.sigma,
+                              alpha=args.alpha,
+                              beta=args.beta)
+        threshold = compute_threshold(num_samples,
+                                      0,
+                                      args.sigma,
+                                      alpha=args.alpha)
+        data.append([num_samples, theta, threshold])
+        num_samples *= 2
+
+    num_samples = args.num_samples_total
+    theta = compute_theta(num_samples,
+                          args.sigma,
+                          alpha=args.alpha,
+                          beta=args.beta)
+    threshold = compute_threshold(num_samples, 0, args.sigma, alpha=args.alpha)
+    data.append([num_samples, theta, threshold])
+
+    df = pandas.DataFrame(
+        data, columns=['num_samples', 'theta', 'threshold-reference'])
+    df = df.set_index('num_samples')
+    print(df)
diff --git a/tests/integration/defs/accuracy/scripts/generate_thresholds.py b/tests/integration/defs/accuracy/scripts/generate_thresholds.py