Skip to content

Commit 74df12b

Browse files
authored
[TRTLLM-4480][doc] Documentation for new accuracy test suite and trtllm-eval (#3946)
* fix formula Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * update doc Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * 1st version Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * polish Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
1 parent 4dfa3cc commit 74df12b

File tree

8 files changed

+426
-113
lines changed

8 files changed

+426
-113
lines changed

examples/trtllm-eval/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Accuracy Evaluation Tool `trtllm-eval`
2+
3+
We provide a CLI tool `trtllm-eval` for evaluating model accuracy. It shares the core evaluation logics with the [accuracy test suite](../../tests/integration/defs/accuracy) of TensorRT-LLM.
4+
5+
`trtllm-eval` is built on the offline API -- [LLM API](https://nvidia.github.io/TensorRT-LLM/llm-api/index.html). It provides developers a unified entrypoint for accuracy evaluation. Compared with the online API [`trtllm-serve`](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve.html), offline API provides clearer error messages and simplifies the debugging workflow.
6+
7+
`trtllm-eval` follows the CLI interface of [`trtllm-serve`](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve.html).
8+
9+
```bash
10+
pip install -r requirements.txt
11+
12+
# Evaluate Llama-3.1-8B-Instruct on MMLU
13+
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar && tar -xf data.tar
14+
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct mmlu --dataset_path data
15+
16+
# Evaluate Llama-3.1-8B-Instruct on GSM8K
17+
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct gsm8k
18+
19+
# Evaluate Llama-3.3-70B-Instruct on GPQA Diamond
20+
trtllm-eval --model meta-llama/Llama-3.3-70B-Instruct gpqa_diamond
21+
```
22+
23+
The `--model` argument accepts either a Hugging Face model ID or a local checkpoint path. By default, `trtllm-eval` runs the model with the PyTorch backend; pass `--backend tensorrt` to switch to the TensorRT backend. Alternatively, the `--model` argument also accepts a local path to pre-built TensorRT engines; in that case, please pass the Hugging Face tokenizer path to the `--tokenizer` argument.
24+
25+
See more details by `trtllm-eval --help`.

examples/trtllm-eval/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
lm_eval[api]==0.4.8

tensorrt_llm/evaluate/interface.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,5 @@ def evaluate(self,
8383
return score
8484

8585
@staticmethod
86-
@abstractmethod
8786
def command(ctx, *args, **kwargs) -> None:
8887
raise NotImplementedError()

tests/README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ pip install -r requirements-dev.txt
5555
cd tests/integration/defs
5656

5757
# example 1: run a case
58-
pytest "accuracy/test_cli_flow.py::TestGpt2CnnDailymail::test_auto_dtype"
58+
pytest "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B::test_auto_dtype"
5959

6060
# example 2: run a test list
6161
pytest --rootdir . --test-list=<a txt file contains on test case per line>
@@ -98,7 +98,7 @@ For more options, refer to pytest --help, paying attention to Custom options add
9898
When you finish setup the model directory, remember to mount it in the docker container.
9999
100100
101-
## 4. C++ runtime test
101+
## 3. C++ runtime test
102102
103103
TRT-LLM C++ runtime tests are using [google-test](https://github.com/google/googletest) framework, and Pytest is used to run sets of these tests.
104104
@@ -107,7 +107,7 @@ Pytest calls these scripts from fixtures prior to launching the test cases.
107107
108108
Details on usage of the resources scripts can be found in the [C++ Test document](../cpp/tests/README.md).
109109
110-
## 5. Performance regression test
110+
## 4. Performance regression test
111111
112112
For performance regression testing in QA and CI, see the [performance test guide](./integration/README.md).
113113
@@ -133,7 +133,9 @@ The priority is A10 > A30 > L40s > A100 > H100 > B200.
133133
134134
Integrations tests usually run entire workflow, containing checkpoint converting, engine building and evaluating, to check functional and accuracy.
135135
136-
Integration tests are stored in `integration/defs`. Once a new integration test case is added, the yml files must be updated to contain the newly added case. Otherwise, the CI will not be able to collect and run this case.
136+
Integration tests are stored in [`integration/defs`](./integration/defs). In particular, please see [`integration/defs/accuracy`](./integration/defs/accuracy) for more detailed guidance to add accuracy tests.
137+
138+
Once a new integration test case is added, the yml files must be updated to contain the newly added case. Otherwise, the CI will not be able to collect and run this case.
137139
138140
## 3. Add a unit test
139141

tests/integration/defs/accuracy/README.md

Lines changed: 288 additions & 26 deletions
Large diffs are not rendered by default.

tests/integration/defs/accuracy/accuracy_core.py

Lines changed: 24 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -37,24 +37,32 @@
3737
from ..trt_test_alternative import check_call, exists
3838

3939

40+
def compute_theta(num_samples: int,
41+
sigma: float,
42+
alpha: float = 0.05,
43+
beta: float = 0.2):
44+
scale = (2 * sigma**2 / num_samples)**0.5
45+
46+
# Single-tail testing
47+
z_alpha = scipy.stats.norm.ppf(alpha)
48+
z_beta = scipy.stats.norm.ppf(beta)
49+
theta = -(z_alpha + z_beta) * scale
50+
return theta
51+
52+
4053
def compute_threshold(num_samples: int,
4154
ref_accuracy: float,
4255
sigma: float,
4356
alpha: float = 0.05,
44-
beta: float = 0.2,
4557
higher_is_better: bool = True):
4658
scale = (2 * sigma**2 / num_samples)**0.5
4759

4860
# Single-tail testing
4961
z_alpha = scipy.stats.norm.ppf(alpha)
5062
if higher_is_better:
51-
threshold = ref_accuracy + z_alpha * scale
63+
return ref_accuracy + z_alpha * scale
5264
else:
53-
threshold = ref_accuracy - z_alpha * scale
54-
55-
z_beta = scipy.stats.norm.ppf(beta)
56-
theta = -(z_alpha + z_beta) * scale
57-
return threshold, theta
65+
return ref_accuracy - z_alpha * scale
5866

5967

6068
class AccuracyTask:
@@ -82,7 +90,7 @@ class AccuracyTask:
8290

8391
def __init__(self, model_name: str):
8492
with open(f"{self.REFERENCE_DIR}/{self.DATASET}.yaml") as f:
85-
self.reference = yaml.safe_load(f)[model_name]
93+
self.reference: List[dict] = yaml.safe_load(f).get(model_name, [])
8694

8795
def get_num_samples_and_threshold(self, **acc_specs):
8896
"""Get num_samples and threshold via accuracy specifications.
@@ -116,12 +124,12 @@ def get_num_samples_and_threshold(self, **acc_specs):
116124
sigma = entry.get("sigma", self.SIGMA)
117125
num_samples = entry.get("num_samples", self.NUM_SAMPLES)
118126
higher_is_better = entry.get("higher_is_better", self.HIGHER_IS_BETTER)
119-
threshold, theta = compute_threshold(num_samples,
120-
accuracy,
121-
sigma=sigma,
122-
alpha=alpha,
123-
beta=beta,
124-
higher_is_better=higher_is_better)
127+
theta = compute_theta(num_samples, sigma=sigma, alpha=alpha, beta=beta)
128+
threshold = compute_threshold(num_samples,
129+
accuracy,
130+
sigma=sigma,
131+
alpha=alpha,
132+
higher_is_better=higher_is_better)
125133
print("===========================================================\n"
126134
"= ACCURACY HYPOTHESIS TESTING\n"
127135
"===========================================================\n"
@@ -256,7 +264,7 @@ class MMLU(AccuracyTask):
256264
DATASET = "mmlu"
257265
DATASET_DIR = f"{llm_models_root()}/datasets/mmlu"
258266

259-
ALPHA = 0.01
267+
ALPHA = 0.05
260268
BETA = 0.2
261269
SIGMA = 50
262270
NUM_SAMPLES = 4096
@@ -273,7 +281,7 @@ class GSM8K(AccuracyTask):
273281
DATASET = "gsm8k"
274282
DATASET_DIR = f"{llm_models_root()}/datasets/openai/gsm8k"
275283

276-
ALPHA = 0.02
284+
ALPHA = 0.05
277285
BETA = 0.2
278286
SIGMA = 50
279287
NUM_SAMPLES = 1319 # Full sample
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
import argparse
16+
17+
import pandas
18+
import scipy
19+
20+
21+
def compute_theta(num_samples: int,
22+
sigma: float,
23+
alpha: float = 0.05,
24+
beta: float = 0.2):
25+
scale = (2 * sigma**2 / num_samples)**0.5
26+
27+
# Single-tail testing
28+
z_alpha = scipy.stats.norm.ppf(alpha)
29+
z_beta = scipy.stats.norm.ppf(beta)
30+
theta = -(z_alpha + z_beta) * scale
31+
return theta
32+
33+
34+
def compute_threshold(num_samples: int,
35+
ref_accuracy: float,
36+
sigma: float,
37+
alpha: float = 0.05,
38+
higher_is_better: bool = True):
39+
scale = (2 * sigma**2 / num_samples)**0.5
40+
41+
# Single-tail testing
42+
z_alpha = scipy.stats.norm.ppf(alpha)
43+
if higher_is_better:
44+
return ref_accuracy + z_alpha * scale
45+
else:
46+
return ref_accuracy - z_alpha * scale
47+
48+
49+
if __name__ == "__main__":
50+
parser = argparse.ArgumentParser()
51+
parser.add_argument("--num_samples_total", type=int, default=8192)
52+
parser.add_argument("--sigma", type=float, default=50)
53+
parser.add_argument("--alpha", type=float, default=0.05)
54+
parser.add_argument("--beta", type=float, default=0.2)
55+
args = parser.parse_args()
56+
57+
data = []
58+
num_samples = 32
59+
while num_samples < args.num_samples_total:
60+
theta = compute_theta(num_samples,
61+
args.sigma,
62+
alpha=args.alpha,
63+
beta=args.beta)
64+
threshold = compute_threshold(num_samples,
65+
0,
66+
args.sigma,
67+
alpha=args.alpha)
68+
data.append([num_samples, theta, threshold])
69+
num_samples *= 2
70+
71+
num_samples = args.num_samples_total
72+
theta = compute_theta(num_samples,
73+
args.sigma,
74+
alpha=args.alpha,
75+
beta=args.beta)
76+
threshold = compute_threshold(num_samples, 0, args.sigma, alpha=args.alpha)
77+
data.append([num_samples, theta, threshold])
78+
79+
df = pandas.DataFrame(
80+
data, columns=['num_samples', 'theta', 'threshold-reference'])
81+
df = df.set_index('num_samples')
82+
print(df)

tests/integration/defs/accuracy/scripts/generate_thresholds.py

Lines changed: 0 additions & 66 deletions
This file was deleted.

0 commit comments

Comments
 (0)