Setup vLLM benchmark CI for H100 #32

huydhn · 2025-05-29T01:20:05Z

The new workflow can be run periodically every 2 hours or on demand by setting the commit from vLLM main branch to benchmark. It works as follows:

When schedule, the workflow checks the latest commits from vLLM main branch chronologically until it finds the latest commit whose vLLM CI Docker image has already been built and has not been benchmarked yet. The Docker image name is public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:<SHA>
When running on demand, it will just check for the request Docker image and returns early if that doesn't exist yet
The workflows uses the official benchmark scripts from vLLM at https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
Instead of using the list of models from vLLM, we are going to use those from https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks so that we can control exactly what to benchmark
4xH100 currently takes 45 minutes to finish all benchmarks

Some more PRs are coming after this:

ROCm MI300x benchmark. I need to figure out the Docker image vLLM uses for this
Adding Llama 4 Scout and Maverick to https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks

Testing

The results are showing up on the dashboard now https://hud.pytorch.org/benchmark/llms?startTime=Fri%2C%2023%20May%202025%2019%3A19%3A35%20GMT&stopTime=Fri%2C%2030%20May%202025%2019%3A19%3A35%20GMT&granularity=day&lBranch=main&lCommit=7f21e8052b5f3948c8a59514a8dc1e9c5eef70d6&rBranch=main&rCommit=7f21e8052b5f3948c8a59514a8dc1e9c5eef70d6&repoName=vllm-project%2Fvllm&benchmarkName=&modelName=All%20Models&backendName=All%20Backends&modeName=All%20Modes&dtypeName=All%20DType&deviceName=All%20Devices&archName=All%20Platforms

Signed-off-by: Huy Do <huydhn@gmail.com>

yangw-dev

LGTM!

yangw-dev · 2025-06-02T18:19:38Z

.github/workflows/vllm-benchmark.yml

+jobs:
+  benchmark-h100:
+    name: Run vLLM benchmarks
+    runs-on: linux.aws.h100.4


for my own knowledge, is this mean instance with 4 h100?

how many of those we have now?

We have 4 of them atm. Also, FYI, there is one 8xH100 runner too.

Setup vLLM benchmark CI for H100

a3f85cf

Signed-off-by: Huy Do <huydhn@gmail.com>

facebook-github-bot added the cla signed label May 29, 2025

Add environment secrets

79db084

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 01:28 — with GitHub Actions Failure

Double check S3 access

fadbeb0

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn temporarily deployed to pytorch-x-vllm May 29, 2025 01:29 — with GitHub Actions Inactive

Debug

ab55341

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 01:30 — with GitHub Actions Failure

Debug

af902a4

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 01:32 — with GitHub Actions Failure

Debug

9374216

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 01:34 — with GitHub Actions Failure

Debug

14b3e13

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 01:45 — with GitHub Actions Failure

Bash, is this you?

c042a8f

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 01:49 — with GitHub Actions Failure

Debug

264df6d

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 01:53 — with GitHub Actions Failure

Debug

a9ff571

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn temporarily deployed to pytorch-x-vllm May 29, 2025 01:56 — with GitHub Actions Inactive

Debug

b9f3d29

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 01:59 — with GitHub Actions Failure

Debug

3238361

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 02:00 — with GitHub Actions Failure

Debug

2de2a8a

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 02:09 — with GitHub Actions Failure

Debug

d44cb7a

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 02:10 — with GitHub Actions Failure

Debug

a78c25a

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 02:12 — with GitHub Actions Failure

vLLM Docker is run as root

c6d29f8

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn temporarily deployed to pytorch-x-vllm May 29, 2025 08:11 — with GitHub Actions Inactive

Add upload step

80a186d

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 20:08 — with GitHub Actions Failure

huydhn added 4 commits May 29, 2025 13:21

Add a comment

344b25a

Signed-off-by: Huy Do <huydhn@gmail.com>

Run every 2 hours

55e7b2e

Signed-off-by: Huy Do <huydhn@gmail.com>

Use the correct commit timestamp

1a6c7bf

Signed-off-by: Huy Do <huydhn@gmail.com>

Add setup-python

dba8cd1

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 29, 2025 20:55 — with GitHub Actions Failure

Missing torch

0352a3d

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 30, 2025 00:34 — with GitHub Actions Failure

Fix upload script mutually exclusive group

d9fb0e2

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 30, 2025 01:35 — with GitHub Actions Failure

Just skip setting ACL

5ea7f89

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn temporarily deployed to pytorch-x-vllm May 30, 2025 07:53 — with GitHub Actions Inactive

huydhn had a problem deploying to pytorch-x-vllm May 30, 2025 08:46 — with GitHub Actions Error

chown?

2a474ef

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn temporarily deployed to pytorch-x-vllm May 30, 2025 09:20 — with GitHub Actions Inactive

Debug

2426af6

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn had a problem deploying to pytorch-x-vllm May 30, 2025 17:02 — with GitHub Actions Error

huydhn marked this pull request as ready for review May 30, 2025 18:01

Found the bug

02ca3ce

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn temporarily deployed to pytorch-x-vllm May 30, 2025 18:40 — with GitHub Actions Inactive

Add vllm_branch and vllm_commit to workflow dispatch

8dcda1b

Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn temporarily deployed to pytorch-x-vllm May 30, 2025 19:24 — with GitHub Actions Inactive

huydhn requested review from yangw-dev, seemethere and malfet May 30, 2025 19:24

yangw-dev approved these changes Jun 2, 2025

View reviewed changes

huydhn merged commit 4a7fc56 into main Jun 2, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Setup vLLM benchmark CI for H100 #32

Setup vLLM benchmark CI for H100 #32

Uh oh!

huydhn commented May 29, 2025 •

edited

Loading

Uh oh!

yangw-dev left a comment

Uh oh!

yangw-dev Jun 2, 2025 •

edited

Loading

Uh oh!

huydhn Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

Setup vLLM benchmark CI for H100 #32

Setup vLLM benchmark CI for H100 #32

Uh oh!

Conversation

huydhn commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

yangw-dev left a comment

Choose a reason for hiding this comment

Uh oh!

yangw-dev Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huydhn Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

huydhn commented May 29, 2025 •

edited

Loading

yangw-dev Jun 2, 2025 •

edited

Loading