Skip to content

Commit 39043e0

Browse files
committed
Merge branch 'main' into eagle-fusion-sync-reduce
2 parents e627f0a + c635c5f commit 39043e0

File tree

165 files changed

+7708
-4619
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

165 files changed

+7708
-4619
lines changed

.buildkite/nightly-benchmarks/nightly-annotation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Please download the visualization scripts in the post
1616
- Download `nightly-benchmarks.zip`.
1717
- In the same folder, run the following code:
1818

19-
```console
19+
```bash
2020
export HF_TOKEN=<your HF token>
2121
apt update
2222
apt install -y git

.buildkite/release-pipeline.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@ steps:
102102
commands:
103103
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
104104
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
105+
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest"
105106
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
106107
env:
107108
DOCKER_BUILDKIT: "1"
@@ -117,6 +118,7 @@ steps:
117118
commands:
118119
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
119120
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:latest --progress plain -f docker/Dockerfile.neuron ."
121+
- "docker push public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:latest"
120122
- "docker push public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:$(buildkite-agent meta-data get release-version)"
121123
env:
122124
DOCKER_BUILDKIT: "1"

.buildkite/scripts/tpu/config_v6e_1.env

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ CONTAINER_NAME=vllm-tpu
44

55
# vllm config
66
MODEL=meta-llama/Llama-3.1-8B-Instruct
7-
MAX_NUM_SEQS=512
8-
MAX_NUM_BATCHED_TOKENS=512
7+
MAX_NUM_SEQS=256
8+
MAX_NUM_BATCHED_TOKENS=1024
99
TENSOR_PARALLEL_SIZE=1
1010
MAX_MODEL_LEN=2048
1111
DOWNLOAD_DIR=/mnt/disks/persist

.github/mergify.yml

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ pull_request_rules:
4545
- files~=^vllm/entrypoints/openai/tool_parsers/llama.*\.py
4646
- files~=^vllm/model_executor/models/.*llama.*\.py
4747
- files~=^vllm/transformers_utils/configs/.*llama.*\.py
48+
- title~=(?i)llama
4849
actions:
4950
label:
5051
add:
@@ -65,6 +66,19 @@ pull_request_rules:
6566
add:
6667
- multi-modality
6768

69+
- name: label-performance
70+
description: Automatically apply performance label
71+
conditions:
72+
- or:
73+
- files~=^benchmarks/
74+
- files~=^vllm/benchmarks/
75+
- files~=^tests/benchmarks/
76+
- files~=^\.buildkite/nightly-benchmarks/
77+
actions:
78+
label:
79+
add:
80+
- performance
81+
6882
- name: label-qwen
6983
description: Automatically apply qwen label
7084
conditions:
@@ -74,7 +88,6 @@ pull_request_rules:
7488
- files~=^vllm/model_executor/models/.*qwen.*\.py
7589
- files~=^vllm/reasoning/.*qwen.*\.py
7690
- title~=(?i)Qwen
77-
- body~=(?i)Qwen
7891
actions:
7992
label:
8093
add:

.pre-commit-config.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,11 @@ repos:
115115
entry: python tools/check_spdx_header.py
116116
language: python
117117
types: [python]
118+
- id: check-root-lazy-imports
119+
name: Check root lazy imports
120+
entry: python tools/check_init_lazy_imports.py
121+
language: python
122+
types: [python]
118123
- id: check-filenames
119124
name: Check for spaces in all filenames
120125
entry: bash

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,11 +154,13 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
154154

155155
## Contact Us
156156

157+
<!-- --8<-- [start:contact-us] -->
157158
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues) or [Discussions](https://github.com/vllm-project/vllm/discussions)
158159
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
159160
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
160161
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
161162
- For collaborations and partnerships, please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu)
163+
<!-- --8<-- [end:contact-us] -->
162164

163165
## Media Kit
164166

benchmarks/README.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -269,6 +269,21 @@ python3 vllm/benchmarks/benchmark_serving.py \
269269
--num-prompts 10
270270
```
271271

272+
### Running With Ramp-Up Request Rate
273+
274+
The benchmark tool also supports ramping up the request rate over the
275+
duration of the benchmark run. This can be useful for stress testing the
276+
server or finding the maximum throughput that it can handle, given some latency budget.
277+
278+
Two ramp-up strategies are supported:
279+
- `linear`: Increases the request rate linearly from a start value to an end value.
280+
- `exponential`: Increases the request rate exponentially.
281+
282+
The following arguments can be used to control the ramp-up:
283+
- `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`).
284+
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
285+
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.
286+
272287
---
273288
## Example - Offline Throughput Benchmark
274289

@@ -387,3 +402,178 @@ python3 vllm/benchmarks/benchmark_throughput.py \
387402
--enable-lora \
388403
--lora-path yard1/llama-2-7b-sql-lora-test
389404
```
405+
406+
---
407+
## Example - Structured Output Benchmark
408+
409+
Benchmark the performance of structured output generation (JSON, grammar, regex).
410+
411+
### Server Setup
412+
413+
```bash
414+
vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests
415+
```
416+
417+
### JSON Schema Benchmark
418+
419+
```bash
420+
python3 benchmarks/benchmark_serving_structured_output.py \
421+
--backend vllm \
422+
--model NousResearch/Hermes-3-Llama-3.1-8B \
423+
--dataset json \
424+
--structured-output-ratio 1.0 \
425+
--request-rate 10 \
426+
--num-prompts 1000
427+
```
428+
429+
### Grammar-based Generation Benchmark
430+
431+
```bash
432+
python3 benchmarks/benchmark_serving_structured_output.py \
433+
--backend vllm \
434+
--model NousResearch/Hermes-3-Llama-3.1-8B \
435+
--dataset grammar \
436+
--structure-type grammar \
437+
--request-rate 10 \
438+
--num-prompts 1000
439+
```
440+
441+
### Regex-based Generation Benchmark
442+
443+
```bash
444+
python3 benchmarks/benchmark_serving_structured_output.py \
445+
--backend vllm \
446+
--model NousResearch/Hermes-3-Llama-3.1-8B \
447+
--dataset regex \
448+
--request-rate 10 \
449+
--num-prompts 1000
450+
```
451+
452+
### Choice-based Generation Benchmark
453+
454+
```bash
455+
python3 benchmarks/benchmark_serving_structured_output.py \
456+
--backend vllm \
457+
--model NousResearch/Hermes-3-Llama-3.1-8B \
458+
--dataset choice \
459+
--request-rate 10 \
460+
--num-prompts 1000
461+
```
462+
463+
### XGrammar Benchmark Dataset
464+
465+
```bash
466+
python3 benchmarks/benchmark_serving_structured_output.py \
467+
--backend vllm \
468+
--model NousResearch/Hermes-3-Llama-3.1-8B \
469+
--dataset xgrammar_bench \
470+
--request-rate 10 \
471+
--num-prompts 1000
472+
```
473+
474+
---
475+
## Example - Long Document QA Throughput Benchmark
476+
477+
Benchmark the performance of long document question-answering with prefix caching.
478+
479+
### Basic Long Document QA Test
480+
481+
```bash
482+
python3 benchmarks/benchmark_long_document_qa_throughput.py \
483+
--model meta-llama/Llama-2-7b-chat-hf \
484+
--enable-prefix-caching \
485+
--num-documents 16 \
486+
--document-length 2000 \
487+
--output-len 50 \
488+
--repeat-count 5
489+
```
490+
491+
### Different Repeat Modes
492+
493+
```bash
494+
# Random mode (default) - shuffle prompts randomly
495+
python3 benchmarks/benchmark_long_document_qa_throughput.py \
496+
--model meta-llama/Llama-2-7b-chat-hf \
497+
--enable-prefix-caching \
498+
--num-documents 8 \
499+
--document-length 3000 \
500+
--repeat-count 3 \
501+
--repeat-mode random
502+
503+
# Tile mode - repeat entire prompt list in sequence
504+
python3 benchmarks/benchmark_long_document_qa_throughput.py \
505+
--model meta-llama/Llama-2-7b-chat-hf \
506+
--enable-prefix-caching \
507+
--num-documents 8 \
508+
--document-length 3000 \
509+
--repeat-count 3 \
510+
--repeat-mode tile
511+
512+
# Interleave mode - repeat each prompt consecutively
513+
python3 benchmarks/benchmark_long_document_qa_throughput.py \
514+
--model meta-llama/Llama-2-7b-chat-hf \
515+
--enable-prefix-caching \
516+
--num-documents 8 \
517+
--document-length 3000 \
518+
--repeat-count 3 \
519+
--repeat-mode interleave
520+
```
521+
522+
---
523+
## Example - Prefix Caching Benchmark
524+
525+
Benchmark the efficiency of automatic prefix caching.
526+
527+
### Fixed Prompt with Prefix Caching
528+
529+
```bash
530+
python3 benchmarks/benchmark_prefix_caching.py \
531+
--model meta-llama/Llama-2-7b-chat-hf \
532+
--enable-prefix-caching \
533+
--num-prompts 1 \
534+
--repeat-count 100 \
535+
--input-length-range 128:256
536+
```
537+
538+
### ShareGPT Dataset with Prefix Caching
539+
540+
```bash
541+
# download dataset
542+
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
543+
544+
python3 benchmarks/benchmark_prefix_caching.py \
545+
--model meta-llama/Llama-2-7b-chat-hf \
546+
--dataset-path /path/ShareGPT_V3_unfiltered_cleaned_split.json \
547+
--enable-prefix-caching \
548+
--num-prompts 20 \
549+
--repeat-count 5 \
550+
--input-length-range 128:256
551+
```
552+
553+
---
554+
## Example - Request Prioritization Benchmark
555+
556+
Benchmark the performance of request prioritization in vLLM.
557+
558+
### Basic Prioritization Test
559+
560+
```bash
561+
python3 benchmarks/benchmark_prioritization.py \
562+
--model meta-llama/Llama-2-7b-chat-hf \
563+
--input-len 128 \
564+
--output-len 64 \
565+
--num-prompts 100 \
566+
--scheduling-policy priority
567+
```
568+
569+
### Multiple Sequences per Prompt
570+
571+
```bash
572+
python3 benchmarks/benchmark_prioritization.py \
573+
--model meta-llama/Llama-2-7b-chat-hf \
574+
--input-len 128 \
575+
--output-len 64 \
576+
--num-prompts 100 \
577+
--scheduling-policy priority \
578+
--n 2
579+
```

0 commit comments

Comments
 (0)