Skip to content

Commit 6369426

Browse files
tanujtiwari1998Your Name
authored andcommitted
Merge commit 'a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f' into tanuj/cleaned
2 parents bf5e004 + a5dd03c commit 6369426

File tree

822 files changed

+59334
-15505
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

822 files changed

+59334
-15505
lines changed

.buildkite/nightly-benchmarks/README.md

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc
1111

1212
## Performance benchmark quick overview
1313

14-
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!), with different models.
14+
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) and Intel® Xeon® Processors, with different models.
1515

1616
**Benchmarking Duration**: about 1hr.
1717

@@ -31,13 +31,27 @@ Performance benchmark will be triggered when:
3131
- A PR being merged into vllm.
3232
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
3333

34+
Manually Trigger the benchmark
35+
36+
```bash
37+
bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
38+
```
39+
40+
Runtime environment variables:
41+
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
42+
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
43+
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
44+
- `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
45+
- `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
46+
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
47+
3448
Nightly benchmark will be triggered when:
3549
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
3650

3751
## Performance benchmark details
3852

3953
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
40-
54+
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
4155
### Latency test
4256

4357
Here is an example of one test inside `latency-tests.json`:
@@ -119,6 +133,30 @@ If you do not see the table, please wait till the benchmark finish running.
119133
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
120134
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
121135

136+
The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
137+
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
138+
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
139+
140+
Here is an example using the script to compare result_a and result_b without detail test name.
141+
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json --ignore_test_name`
142+
143+
| | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
144+
|----|----------------------------------------|----------------------------------------|----------|
145+
| 0 | 142.633982 | 156.526018 | 1.097396 |
146+
| 1 | 241.620334 | 294.018783 | 1.216863 |
147+
| 2 | 218.298905 | 262.664916 | 1.203235 |
148+
| 3 | 242.743860 | 299.816190 | 1.235113 |
149+
150+
Here is an example using the script to compare result_a and result_b with detail test name.
151+
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
152+
| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio |
153+
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------|
154+
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 |
155+
| 1 | serving_llama8B_tp1_sharegpt_qps_16 | 241.620334 | serving_llama8B_tp1_sharegpt_qps_16 | 294.018783 | 1.216863 |
156+
| 2 | serving_llama8B_tp1_sharegpt_qps_4 | 218.298905 | serving_llama8B_tp1_sharegpt_qps_4 | 262.664916 | 1.203235 |
157+
| 3 | serving_llama8B_tp1_sharegpt_qps_inf | 242.743860 | serving_llama8B_tp1_sharegpt_qps_inf | 299.816190 | 1.235113 |
158+
| 4 | serving_llama8B_tp2_random_1024_128_qps_1 | 96.613390 | serving_llama8B_tp4_random_1024_128_qps_1 | 108.404853 | 1.122048 |
159+
122160
## Nightly test details
123161

124162
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.

.buildkite/nightly-benchmarks/nightly-annotation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Please download the visualization scripts in the post
1616
- Download `nightly-benchmarks.zip`.
1717
- In the same folder, run the following code:
1818

19-
```console
19+
```bash
2020
export HF_TOKEN=<your HF token>
2121
apt update
2222
apt install -y git

.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@
44
- Input length: 32 tokens.
55
- Output length: 128 tokens.
66
- Batch size: fixed (8).
7-
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
7+
- GPU Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
8+
- CPU Models: llama-3.1 8B.
89
- Evaluation metrics: end-to-end latency (mean, median, p99).
910

1011
{latency_tests_markdown_table}
@@ -14,7 +15,8 @@
1415
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
1516
- Output length: the corresponding output length of these 200 prompts.
1617
- Batch size: dynamically determined by vllm to achieve maximum throughput.
17-
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
18+
- GPU Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
19+
- CPU Models: llama-3.1 8B.
1820
- Evaluation metrics: throughput.
1921

2022
{throughput_tests_markdown_table}
@@ -25,12 +27,18 @@
2527
- Output length: the corresponding output length of these 200 prompts.
2628
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
2729
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
28-
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
29-
- We also added a speculative decoding test for llama-3 70B, under QPS 2
30+
- GPU Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
31+
- We also added a speculative decoding test for llama-3 70B on GPU, under QPS 2
32+
- CPU Models: llama-3.1 8B.
3033
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
34+
- For CPU, we added random dataset tests to benchmark fixed input/output length with 100 prompts.
3135

3236
{serving_tests_markdown_table}
3337

38+
## Platform Information
39+
40+
{platform_markdown_table}
41+
3442
## json version of the benchmarking tables
3543

3644
This section contains the data of the markdown tables above in JSON format.
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3+
import argparse
4+
5+
import pandas as pd
6+
7+
8+
def compare_data_columns(
9+
files, name_column, data_column, drop_column, ignore_test_name=False
10+
):
11+
print("\ncompare_data_column: " + data_column)
12+
frames = []
13+
compare_frames = []
14+
for file in files:
15+
data_df = pd.read_json(file)
16+
serving_df = data_df.dropna(subset=[drop_column], ignore_index=True)
17+
if ignore_test_name is False:
18+
serving_df = serving_df.rename(columns={name_column: file + "_name"})
19+
frames.append(serving_df[file + "_name"])
20+
serving_df = serving_df.rename(columns={data_column: file})
21+
frames.append(serving_df[file])
22+
compare_frames.append(serving_df[file])
23+
if len(compare_frames) >= 2:
24+
# Compare numbers among two files
25+
ratio_df = compare_frames[1] / compare_frames[0]
26+
frames.append(ratio_df)
27+
compare_frames.pop(1)
28+
29+
concat_df = pd.concat(frames, axis=1)
30+
return concat_df
31+
32+
33+
if __name__ == "__main__":
34+
parser = argparse.ArgumentParser()
35+
parser.add_argument(
36+
"-f", "--file", action="append", type=str, help="input file name"
37+
)
38+
parser.add_argument(
39+
"--ignore_test_name", action="store_true", help="ignore_test_name or not"
40+
)
41+
args = parser.parse_args()
42+
files = args.file
43+
print("comparing : " + ", ".join(files))
44+
45+
drop_column = "P99"
46+
name_column = "Test name"
47+
data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
48+
html_msgs_for_data_cols = [
49+
"Compare Output Tokens /n",
50+
"Median TTFT /n",
51+
"Median TPOT /n",
52+
]
53+
ignore_test_name = args.ignore_test_name
54+
with open("perf_comparison.html", "w") as text_file:
55+
for i in range(len(data_cols_to_compare)):
56+
output_df = compare_data_columns(
57+
files,
58+
name_column,
59+
data_cols_to_compare[i],
60+
drop_column,
61+
ignore_test_name=ignore_test_name,
62+
)
63+
print(output_df)
64+
html = output_df.to_html()
65+
text_file.write(html_msgs_for_data_cols[i])
66+
text_file.write(html)

.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py

Lines changed: 53 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,11 @@
33

44
import json
55
import os
6+
from importlib import util
67
from pathlib import Path
78

89
import pandas as pd
10+
import psutil
911
from tabulate import tabulate
1012

1113
results_folder = Path("results/")
@@ -29,28 +31,30 @@
2931
throughput_results_column_mapping = {
3032
"test_name": "Test name",
3133
"gpu_type": "GPU",
32-
# "num_requests": "# of req.",
33-
# "total_num_tokens": "Total # of tokens",
34-
# "elapsed_time": "Elapsed time (s)",
34+
"num_requests": "# of req.",
35+
"total_num_tokens": "Total # of tokens",
36+
"elapsed_time": "Elapsed time (s)",
3537
"requests_per_second": "Tput (req/s)",
36-
# "tokens_per_second": "Tput (tok/s)",
38+
"tokens_per_second": "Tput (tok/s)",
3739
}
3840

3941
# serving results and the keys that will be printed into markdown
4042
serving_results = []
4143
serving_column_mapping = {
4244
"test_name": "Test name",
4345
"gpu_type": "GPU",
44-
# "completed": "# of req.",
46+
"completed": "# of req.",
4547
"request_throughput": "Tput (req/s)",
46-
# "input_throughput": "Input Tput (tok/s)",
47-
# "output_throughput": "Output Tput (tok/s)",
48+
"total_token_throughput": "Total Token Tput (tok/s)",
49+
"output_throughput": "Output Tput (tok/s)",
50+
"total_input_tokens": "Total input tokens",
51+
"total_output_tokens": "Total output tokens",
4852
"mean_ttft_ms": "Mean TTFT (ms)",
4953
"median_ttft_ms": "Median TTFT (ms)",
5054
"p99_ttft_ms": "P99 TTFT (ms)",
51-
# "mean_tpot_ms": "Mean TPOT (ms)",
52-
# "median_tpot_ms": "Median",
53-
# "p99_tpot_ms": "P99",
55+
"mean_tpot_ms": "Mean TPOT (ms)",
56+
"median_tpot_ms": "Median",
57+
"p99_tpot_ms": "P99",
5458
"mean_itl_ms": "Mean ITL (ms)",
5559
"median_itl_ms": "Median ITL (ms)",
5660
"p99_itl_ms": "P99 ITL (ms)",
@@ -75,6 +79,20 @@ def results_to_json(latency, throughput, serving):
7579
)
7680

7781

82+
def get_size_with_unit(bytes, suffix="B"):
83+
"""
84+
Scale bytes to its proper format
85+
e.g:
86+
1253656 => '1.20MB'
87+
1253656678 => '1.17GB'
88+
"""
89+
factor = 1024
90+
for unit in ["", "K", "M", "G", "T", "P"]:
91+
if bytes < factor:
92+
return f"{bytes:.2f}{unit}{suffix}"
93+
bytes /= factor
94+
95+
7896
if __name__ == "__main__":
7997
# collect results
8098
for test_file in results_folder.glob("*.json"):
@@ -155,6 +173,27 @@ def results_to_json(latency, throughput, serving):
155173
serving_results = pd.DataFrame.from_dict(serving_results)
156174
throughput_results = pd.DataFrame.from_dict(throughput_results)
157175

176+
svmem = psutil.virtual_memory()
177+
platform_data = {
178+
"Physical cores": [psutil.cpu_count(logical=False)],
179+
"Total cores": [psutil.cpu_count(logical=True)],
180+
"Total Memory": [get_size_with_unit(svmem.total)],
181+
}
182+
183+
if util.find_spec("numa") is not None:
184+
from numa import info
185+
186+
platform_data["Total NUMA nodes"] = [info.get_num_configured_nodes()]
187+
188+
if util.find_spec("cpuinfo") is not None:
189+
from cpuinfo import get_cpu_info
190+
191+
platform_data["CPU Brand"] = [get_cpu_info()["brand_raw"]]
192+
193+
platform_results = pd.DataFrame.from_dict(
194+
platform_data, orient="index", columns=["Platform Info"]
195+
)
196+
158197
raw_results_json = results_to_json(
159198
latency_results, throughput_results, serving_results
160199
)
@@ -200,6 +239,9 @@ def results_to_json(latency, throughput, serving):
200239
throughput_md_table = tabulate(
201240
throughput_results, headers="keys", tablefmt="pipe", showindex=False
202241
)
242+
platform_md_table = tabulate(
243+
platform_results, headers="keys", tablefmt="pipe", showindex=True
244+
)
203245

204246
# document the result
205247
with open(results_folder / "benchmark_results.md", "w") as f:
@@ -211,6 +253,7 @@ def results_to_json(latency, throughput, serving):
211253
latency_tests_markdown_table=latency_md_table,
212254
throughput_tests_markdown_table=throughput_md_table,
213255
serving_tests_markdown_table=serving_md_table,
256+
platform_markdown_table=platform_md_table,
214257
benchmarking_results_in_json_string=processed_results_json,
215258
)
216259
f.write(results)

0 commit comments

Comments
 (0)