Skip to content

Rebase 06 06 #1383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 735 commits into
base: habana_main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
735 commits
Select commit Hold shift + click to select a range
e56f44d
Support datasets in `vllm bench serve` and sync with benchmark_[servi…
mgoin May 27, 2025
51e98e4
[Bugfix] Disable prefix caching by default for benchmark (#18771)
cascade812 May 28, 2025
a3896c7
[Build] Fixes for CMake install (#18570)
ProExpertProg May 28, 2025
d73a945
[Core] Improve Tensor serialisation (#18774)
lgeiger May 28, 2025
794ae1f
[rocm] Fix wrong attention log (#18764)
fxmarty-amd May 28, 2025
3e9ce60
[Bugfix] Fix nomic max_model_len (#18755)
noooop May 28, 2025
9a21e33
[Bugfix]: correctly propagate errors message caught at the chat_templ…
gcalmettes May 28, 2025
774c5fd
[V1] fix torch profiling for V1 offline scenarios (#18445)
divakar-amd May 28, 2025
5e13c07
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal …
RonaldBXu May 28, 2025
b78f844
[Bugfix][FailingTest]Fix test_model_load_with_params.py (#18758)
rabi May 28, 2025
7f2c1a8
[Deprecation] Require overriding `get_dummy_text` and `get_dummy_mm_d…
DarkLight1337 May 28, 2025
0f0926b
[Deprecation] Remove unused sync methods in `async_timeout` (#18792)
DarkLight1337 May 28, 2025
0c492b7
[Deprecation] Remove fallbacks for Embeddings API (#18795)
DarkLight1337 May 28, 2025
de65fc8
[CI] improve embed testing (#18747)
noooop May 28, 2025
aa42561
Fix PiecewiseCompileInterpreter (#17338)
zou3519 May 28, 2025
ce75efe
[BugFix] FA2 MLA Accuracy Issue (#18807)
LucasWilkinson May 28, 2025
d781930
[Platform][Dist] Make torch distributed process group extendable (#18…
MengqingCao May 28, 2025
4c2b38c
Enable Pydantic mypy checks and convert configs to Pydantic dataclass…
hmellor May 28, 2025
435fa95
[Frontend] add run batch to CLI (#18804)
reidliu41 May 28, 2025
6e4cea1
decrement server_load on listen for disconnect (#18784)
daniel-salib May 28, 2025
321331b
[Core] Add Lora Support to Beam Search (#18346)
alex-jw-brooks May 28, 2025
fced756
[Chore] update ty configuration (#18839)
aarnphm May 28, 2025
c68b5c6
[Misc] fix olmoe model layer can't laod in tp gt 1 (#18828)
lengrongfu May 28, 2025
0e98964
[V1][Metrics] Remove metrics that were deprecated in 0.8 (#18837)
markmc May 28, 2025
a09c7ca
[Chore][Spec Decode] Update check NoneType instead of assigning varia…
aarnphm May 28, 2025
643622b
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend (…
Akshat-Tripathi May 28, 2025
6dbe5b5
Remove checks for `None` for fields which should never be `None` (#17…
hmellor May 28, 2025
7951d78
[Core] Enable CUDA graphs for DP + All2All kernels (#18724)
varun-sundar-rabindranath May 28, 2025
269d901
[Bugfix][ROCm] fix the power of 2 exception from triton_unified_atten…
hongxiayang May 28, 2025
515b413
Prevent the cross-encoder logic from being applied to classification …
maxdebayser May 29, 2025
26b4fa4
Add ability to use CUDAGraphs with use_inductor=False (#17345)
zou3519 May 29, 2025
8e882ff
[Bugfix][TPU] fix moe custom kernel import (#18853)
yaochengji May 29, 2025
1661a9c
[Doc][Neuron] Update documentation for Neuron (#18868)
elaineyz May 29, 2025
3c49dbd
Skip device and quant Pydantic validation to make plugin device work …
Yikun May 29, 2025
fd7bb88
Fixes a dead link in nightly benchmark readme (#18856)
nerdalert May 29, 2025
972eddf
[Neuron] Add multi-LoRA support for Neuron. (#18284)
aws-satyajith May 29, 2025
34d6c44
[LoRA] Add LoRA support for InternVL (#18842)
jeejeelee May 29, 2025
a652e71
[Doc] Remove redundant spaces from compatibility_matrix.md (#18891)
windsonsea May 29, 2025
e740d07
[doc] add CLI doc (#18871)
reidliu41 May 29, 2025
7fcfd95
[Bugfix] Fix misleading information in the documentation (#18845)
jeejeelee May 29, 2025
24d0ef8
[Misc] Replace TODO in serving transcription (#18895)
NickLucche May 29, 2025
0b1447f
[Bugfix] Ensure tensors are contiguous during serialisation (#18860)
lgeiger May 29, 2025
f274581
[BugFix] Update pydantic to fix error on python 3.10 (#18852)
ProExpertProg May 29, 2025
f8977c2
Fix an error in dummy weight loading for quantization models (#18855)
Chenyaaang May 29, 2025
b169d5f
[Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp…
Duyi-Wang May 29, 2025
6f29094
[Doc] Fix codeblocks formatting in LoRA adapters documentation (#18907)
Zerohertz May 29, 2025
c9479b2
[Bugfix] Fix the failing gte embedding test (#18720)
Isotr0py May 29, 2025
da4b69d
[Attention][V1] Toggle for v1 attention backend (#18275)
gshtras May 29, 2025
1b7cfd5
[ROCm][V0][Attention] Revert to the previous FA triton kernel (#18226)
gshtras May 29, 2025
c290340
[Deprecation] Disallow pos-args other than `model` when initializing …
DarkLight1337 May 29, 2025
d58f9c7
[Misc] Remove duplicate init for self.vllm_config (#18896)
googs1025 May 29, 2025
32ce3cf
[V1] Allocate kv_cache with stride order for V1 (#18775)
NickLucche May 29, 2025
d1d61f3
[BugFix] Make DP work with connector-delayed new requests (#18559)
njhill May 29, 2025
64eaf5f
[P/D] NixlConnector DP fixes (#18903)
wseaton May 29, 2025
a521ef0
Use standalone_compile by default in torch >= 2.8.0 (#18846)
zou3519 May 29, 2025
a1cc9f3
[TPU] remove transpose ops in moe kernel (#18923)
yaochengji May 29, 2025
d54af61
[Bugfix] Fix PP default fallback behavior for V1 (#18915)
mgoin May 30, 2025
1aa2f81
[Misc] Update type annotation for rotary embedding `base` (#18914)
DarkLight1337 May 30, 2025
3132290
[TPU][CI/CD] Clean up docker for TPU tests. (#18926)
CAROLZXYZXY May 30, 2025
3de3ead
improve the robustness of parsing vlms config in AutoRound (#18894)
wenhuach21 May 30, 2025
77164da
[Bugfix] Consistent ascii handling in tool parsers (#18883)
chaunceyjiang May 30, 2025
3987e2a
[Model] Use AutoWeightsLoader for mamba2 (#18918)
jinyouzhi May 30, 2025
5acf828
[docs] fix: fix markdown syntax (#18927)
eric-haibin-lin May 30, 2025
77b6e74
[ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_ML…
vllmellm May 30, 2025
4d0a154
[Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy (#18…
mgoin May 30, 2025
4f4a6b8
[Deprecation] Remove mean pooling default for `Qwen2EmbeddingModel` (…
DarkLight1337 May 30, 2025
6acb7a6
[Misc]Fix benchmarks/README.md for speculative decoding (#18897)
rabi May 30, 2025
8f8900c
[doc] add mkdocs doc (#18930)
reidliu41 May 30, 2025
c3bb9f2
[Model] Use in-place adds in SigLIP (#18922)
lgeiger May 30, 2025
5f1d0c8
[Bugfix][Failing Test] Fix test_vllm_port.py (#18618)
rabi May 30, 2025
4577fc9
[Misc]Fix typo (#18947)
Always-Naive May 30, 2025
fba02e3
[Bugfix][TPU] Fix tpu model runner testcase failure (#18810)
CAROLZXYZXY May 30, 2025
43ff405
[CI/Build] remove regex from build dependencies (#18945)
dtrifiro May 30, 2025
e1fadf1
[Feature] minicpm eagle support (#18943)
huangyuxiang03 May 30, 2025
ec6833c
[doc] show the count for fork and watch (#18950)
reidliu41 May 30, 2025
b29ca5c
[Docs] Update SECURITY.md with link to our security guide (#18961)
russellb May 30, 2025
84ec470
Improve "failed to get the hash of the compiled graph" error (#18956)
zou3519 May 30, 2025
2dbe8c0
[Perf] API-server scaleout with many-to-many server-engine comms (#1…
njhill May 30, 2025
f49239c
Benchmark script for fp8 vs bf16 gemm (#17126)
mgoin May 30, 2025
5a86416
[VLM] Add PP support and fix GPTQ inference for Ovis models (#18958)
Isotr0py May 30, 2025
7f21e80
[Misc] add group_size is -1 in awq quantization (#18910)
lengrongfu May 30, 2025
1dab4d5
Tool parser regex timeout handling (#18960)
wseaton May 30, 2025
0f71e24
[Docs] Correct multiprocessing design doc (#18964)
lgeiger May 31, 2025
7782464
create util function for batched arange (#18937)
yuguo68 May 31, 2025
dff80b0
[Frontend] Add rerank support to run_batch endpoint (#16278)
pooyadavoodi May 31, 2025
1e12352
[Misc] Fix estimated max model len msg (#18966)
sarckk May 31, 2025
ba5111f
[Bugfix]: Fix the incompatibility issue with Structured Outputs when …
chaunceyjiang May 31, 2025
b8b9047
fix security issue of logging llm output (#18980)
luccafong May 31, 2025
2a50ef5
[Neuron] Add Multi-Modal model support for Neuron (#18921)
aws-satyajith May 31, 2025
749f5bd
[doc] fix the list rendering issue - security.md (#18982)
reidliu41 May 31, 2025
c55d804
[BugFix] Pydantic part 2 (#18911)
ProExpertProg May 31, 2025
0f5e0d5
[FEAT][ROCm] Add AITER grouped topk for DeepSeekV2 (#18825)
vllmellm May 31, 2025
f2c3f66
[Bugfix] Fix for issue 17396 (#18773)
frreiss May 31, 2025
306d604
[ROCm][Kernel] Add gfx950 support for skinny gemms (#18010)
charlifu May 31, 2025
8bf507d
[P/D] NixlConnector use cache device index for memory registration (#…
ptarasiewiczNV May 31, 2025
9a1b9b9
[BugFix] Fix multi-node offline data-parallel (#18981)
njhill May 31, 2025
20079c6
[Misc] add return token strs for tokenize (#18941)
reidliu41 May 31, 2025
bbfa0c6
[Misc][Benchmark] Add support for CustomDataset (#18511)
ekagra-ranjan May 31, 2025
1bc86a3
[Bugfix] Fix EAGLE3 broken logits (#18909)
benchislett Jun 1, 2025
6aa8f9a
[Core] Rework dtype resolution (#18751)
DarkLight1337 Jun 1, 2025
a35ca76
[LoRA] Support dynamically initialize `packed_modules_mapping` for VL…
Isotr0py Jun 1, 2025
c594cbf
[doc] small fix - mkdocs (#18996)
reidliu41 Jun 1, 2025
2ad6194
Let max_num_batched_tokens use human_readable_int for large numbers (…
mgoin Jun 1, 2025
aa54a7b
[BugFix] fix data parallel construct ipv6 url addres (#18991)
lengrongfu Jun 1, 2025
2b102d5
[BugFix] Fix incorrect metrics shutdown error log message (#18992)
njhill Jun 1, 2025
432ec99
[doc] wrong output (#19000)
reidliu41 Jun 1, 2025
d6fd3a3
[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecess…
izhuhaoran Jun 1, 2025
b9f61e1
[Bugfix][Nixl] Fix DP Metadata Handshake (#19008)
robertgshaw2-redhat Jun 2, 2025
9760fd8
[Core] Support inplace model weights loading (#18745)
22quinn Jun 2, 2025
5b168b6
[doc] add pytest tips (#19010)
reidliu41 Jun 2, 2025
ebb1ec9
[Model] enable data parallel for Llama4 vision encoder (#18368)
jennyyyyzhen Jun 2, 2025
20133cf
[Frontend] enable custom logging for the uvicorn server (OpenAI API s…
fpaupier Jun 2, 2025
ca2f6b9
[Bugfix][Model] Attempt to fix eagle in V0. (#18978)
gshtras Jun 2, 2025
c57d577
add an absolute path for run.sh (#18258)
calvin0327 Jun 2, 2025
9112b44
[Hardware][TPU] Initial support of model parallelism with single work…
lsy323 Jun 3, 2025
5bc1ad6
[Doc] Remove duplicate TOCs during MkDocs migration (#19021)
Zerohertz Jun 3, 2025
8a57872
[Bugfix][EP+DP] Use pplx-kernel internode instead of intranode (#19034)
tlrmchlsmth Jun 3, 2025
4ce42f9
Adding "LoRA Test %N" to AMD production tests (#18929)
Concurrensee Jun 3, 2025
8655f47
[CPU][CI] Re-enable the CPU CI tests (#19046)
bigPYJ1151 Jun 3, 2025
9e6f61e
[ROCm][Build] Clean up the ROCm build (#19040)
gshtras Jun 3, 2025
bdce64f
[V1] Support DP with Ray (#18779)
ruisearch42 Jun 3, 2025
1282bd8
Add tarsier model support (#18985)
princepride Jun 3, 2025
17430e3
[bugfix] small fix logic issue (#18999)
reidliu41 Jun 3, 2025
cc97728
Reduce logs in CLI scripts and plugin loader (#18970)
mgoin Jun 3, 2025
d32aa2e
[Bugfix] Use cmake 3.26.1 instead of 3.26 to avoid build failure (#19…
houseroad Jun 3, 2025
f32fcd9
[v1][KVCacheManager] Rename BlockHashType to BlockHash (#19015)
heheda12345 Jun 3, 2025
6d18ed2
Update docker docs with ARM CUDA cross-compile (#19037)
mgoin Jun 3, 2025
42243fb
[Doc] Add InternVL LoRA support (#19055)
jeejeelee Jun 3, 2025
ec2dcd8
[Misc] Update `WeightsMapper` for qwen2-vl/qwen2.5-vl (#19054)
Isotr0py Jun 3, 2025
118ff92
[Doc] Update V1 user guide for embedding and enc-dec models (#19060)
DarkLight1337 Jun 3, 2025
4e88723
[doc] clarify windows support (#19088)
youkaichao Jun 3, 2025
4e68ae5
[CI/Build] Remove V0 LoRA test (#19066)
jeejeelee Jun 3, 2025
476844d
Fix underscores in dict keys passed via CLI (#19030)
hmellor Jun 3, 2025
d81edde
[Bugfix] disable processor cache (#19068)
zucchini-nlp Jun 3, 2025
d00dd65
[Doc] Improve the Pull Request template with key components (#19086)
houseroad Jun 3, 2025
4b7817c
[Misc] Add missing `_Backend` enums (#19081)
NickLucche Jun 3, 2025
d054da1
[Misc] fix: add miss best_of param validation (#18555)
googs1025 Jun 3, 2025
02f0c7b
[Misc] Add SPDX-FileCopyrightText (#19100)
simon-mo Jun 3, 2025
19bdaf3
[Doc] Readme standardization (#18695)
SorenDreano Jun 3, 2025
01eee40
[doc] update docker version (#19074)
reidliu41 Jun 3, 2025
fa98d77
[Kernel] DeepEP dispatch-combine kernel integration (#18434)
varun-sundar-rabindranath Jun 3, 2025
bdf1396
[V1] Support cross-layer KV sharing (#18212)
sarckk Jun 3, 2025
e31446b
[Perf] Tune `scaled_fp8_quant` by increasing vectorization (#18844)
mgoin Jun 3, 2025
6865fe0
Fix interaction between `Optional` and `Annotated` in CLI typing (#19…
hmellor Jun 3, 2025
6cac54f
[v1] Re-init input batch for multiple kv cache groups (#18654)
heheda12345 Jun 3, 2025
135cf55
[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder wi…
ekagra-ranjan Jun 3, 2025
b5fd950
[Bugfix] get_num_blocks_to_allocate with null_block (#19031)
heheda12345 Jun 3, 2025
4de790f
[Bugfix]: Fix the incompatibility issue with tool_choice 'required' w…
chaunceyjiang Jun 3, 2025
5d96533
[Bugfix][P/D] Fix Prefix Cache Bug (#18411)
NickLucche Jun 3, 2025
a8da78e
[Bugfix] Max concurrency estimation and check_enough_kv_cache_memory …
heheda12345 Jun 4, 2025
b712be9
feat: add data parallel rank to KVEventBatch (#18925)
PeaBrane Jun 4, 2025
abd7df2
[Misc] Fix path and python alias errors in disagg_prefill exmaples (#…
Jeffwan Jun 4, 2025
52dceb1
[Docs] Add developer doc about CI failures (#18782)
russellb Jun 4, 2025
4555143
[CPU] V1 support for the CPU backend (#16441)
bigPYJ1151 Jun 4, 2025
1409ef9
[Core] Cast multimodal input in hf processor (#18862)
lgeiger Jun 4, 2025
5d6d1ad
[KERNEL] Sampler. CUDA kernel for applying repetition penalty (#18437)
vadiklyutiy Jun 4, 2025
8d646c2
[Cleanup][v1]:remote guided-decoding-backend for example (#19059)
calvin0327 Jun 4, 2025
41aa578
[NVIDIA] Add Cutlass MLA backend (#17625)
kaixih Jun 4, 2025
b124e10
[Bugfix] Fix FA3 full cuda graph correctness (#19106)
WoosukKwon Jun 4, 2025
3336c8c
Fix #19130 (#19132)
princepride Jun 4, 2025
8e972d9
[TPU] Skip hanging tests (#19115)
lsy323 Jun 4, 2025
2669a0d
Fix ValueError: Missing value for tag key(s): model_name,engine. (#19…
eicherseiji Jun 4, 2025
8711bc5
[Misc] Add packages for benchmark as extra dependency (#19089)
Isotr0py Jun 4, 2025
35cf32d
Improve the output precision of embedding models (#19092)
noooop Jun 4, 2025
01dc9a7
[CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (#18678)
DarkLight1337 Jun 4, 2025
02658c2
Add DeepSeek-R1-0528 function call chat template (#18874)
Xu-Wenqing Jun 4, 2025
5f2cd25
Sm100 blockwise fp8 swap ab (#18564)
IwakuraRein Jun 4, 2025
8f4ffbd
[Doc] Update V1 Guide for embedding models (#19141)
DarkLight1337 Jun 4, 2025
c8dcc15
Allow AsyncLLMEngine.generate to target a specific DP rank (#19102)
jmswen Jun 4, 2025
d459fae
[Bugfix][EP+DP] Fix internode check (#19112)
tlrmchlsmth Jun 4, 2025
53a5a0c
[Perf] Tunings for SM100 FP8 CUTLASS kernel (#18778)
mgoin Jun 4, 2025
7ee2590
[TPU] Update dynamo dump file name in compilation test (#19108)
lsy323 Jun 4, 2025
ef3f98b
[Bugfix] fix v1 cpu worker fails on macOS (#19121)
kebe7jun Jun 4, 2025
c3fd4d6
[Kernel] Integrate batched/masked deepgemm kernel (#19111)
varun-sundar-rabindranath Jun 4, 2025
23027e2
[Misc] refactor: simplify EngineCoreClient.make_async_mp_client in As…
googs1025 Jun 4, 2025
b2fac67
[P/D] Heterogeneous TP (#18833)
NickLucche Jun 4, 2025
78dcf56
[doc] small fix (#19167)
reidliu41 Jun 5, 2025
c56ed8b
[Bugfix][Nixl] Fix full prefix cache hit bug (#18632)
robertgshaw2-redhat Jun 5, 2025
a408820
[Bugfix] Fix port handling in make_zmq_path (#19117)
mgoin Jun 5, 2025
25b918e
[Torch Nightly]add missing dependency (#18770)
yangw-dev Jun 5, 2025
0678b52
Handle non-serializable objects when dumping benchmark results (#19114)
huydhn Jun 5, 2025
af7fc84
[BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (#19171)
WoosukKwon Jun 5, 2025
8fc5750
[Bugfix]: Fix the incompatibility issue with stream when Thinking is …
chaunceyjiang Jun 5, 2025
da40380
[Build] Annotate wheel and container path for release workflow (#19162)
simon-mo Jun 5, 2025
1809308
[Misc] Remove unnecessary fallback to prefill-decode attention (#19138)
vllmellm Jun 5, 2025
188a459
[Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (#19105)
22quinn Jun 5, 2025
1aeb925
[Frontend] improve vllm run-batch --help display (#19187)
reidliu41 Jun 5, 2025
9bc8bb0
[Bugfix] properly catch PIL-related errors for vision models when inc…
gcalmettes Jun 5, 2025
f20f9f0
[mistral_common] Add v11 tokenizer (#19193)
patrickvonplaten Jun 5, 2025
ec89524
Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (#19205)
Xu-Wenqing Jun 5, 2025
61059be
[Hardware][NVIDIA] FP4 MoE kernel optimization (#19110)
dubcyfor3 Jun 5, 2025
85e2b7b
[MISC][Bugfix] Use less CPU when message queue has been empty for som…
p12tic Jun 5, 2025
9ef9173
[P/D][NixlConnector] Enable FlashInfer backend (#19090)
NickLucche Jun 5, 2025
aa49f14
[Quantization] Skip Fp4 Test for `compressed-tensors` (#19217)
dsikka Jun 5, 2025
8736030
[V1] Use FlashInfer by default on Blackwell GPUs (#19118)
mgoin Jun 5, 2025
cb6d572
[Model] NemotronH support (#18863)
vegaluisjose Jun 5, 2025
c8134be
Fix AOPerModuleConfig name changes (#18869)
jerryzh168 Jun 6, 2025
3465b87
[Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (#19033)
benchislett Jun 6, 2025
f8a1a2d
[v1] Hybrid Memory Allocator (#17996)
heheda12345 Jun 6, 2025
b61dc5f
[TPU] update torch_xla pin (#19231)
yaochengji Jun 6, 2025
3da2313
Support allowed_token_ids in ChatCompletionRequest (#19143)
xu-song Jun 6, 2025
91a2ef9
[Chore] update CODEOWNERS (#19247)
aarnphm Jun 6, 2025
90b78ec
[v1][P/D] Fix a edge case in kv cache schedule (#19182)
KingsleyZhang123 Jun 6, 2025
0d49483
[TPU] fix kv cache dtype in model runner (#19244)
yaochengji Jun 6, 2025
9487035
[Quantization] Bump compressed-tensors version; update NVFP4A16 test …
dsikka Jun 6, 2025
65c6944
[Docs] Improve V1 KVConnector interface documentation (#19172)
njhill Jun 6, 2025
da511d5
Fix CompilationConfig repr (#19091)
zou3519 Jun 6, 2025
f168b85
Unit Test for run_dp_sharded_vision_model (#19103)
cryptopic Jun 6, 2025
7661e92
[Model] Optimize nemotron_h implementation (#19249)
jeejeelee Jun 6, 2025
0efcb19
Merge remote-tracking branch 'upstream/main' into habana_main
michalkuligowski Jun 6, 2025
842ece6
Fix return type in sampling_metadata.py
michalkuligowski Jun 6, 2025
da42d35
Fix get_bin_counts_and mask return type sampler.py
michalkuligowski Jun 6, 2025
5a8b573
Update base.py
michalkuligowski Jun 6, 2025
15939ce
Update base.py
michalkuligowski Jun 6, 2025
29d10b0
Update base.py
michalkuligowski Jun 6, 2025
39e7569
Update hpu_model_runner.py
michalkuligowski Jun 6, 2025
634baeb
Update hpu_model_runner.py
michalkuligowski Jun 6, 2025
ee7a9f2
Bring back general triton config for hpu in triton_flash_attention.py
michalkuligowski Jun 6, 2025
f649c0d
Update triton_flash_attention.py
michalkuligowski Jun 6, 2025
89627ff
Fix return type for hpu versions of forward in rotary_embedding.py
michalkuligowski Jun 8, 2025
e43ff48
Update gpt_bigcode.py
michalkuligowski Jun 8, 2025
3024c88
Update granite.py
michalkuligowski Jun 8, 2025
d903b02
Update llama.py
michalkuligowski Jun 8, 2025
ff2fb41
Update mixtral.py
michalkuligowski Jun 8, 2025
43e3202
Fixed dtype for tuple in mllama.py
michalkuligowski Jun 8, 2025
6130737
Bring back missing param in invocation to metadata structure in hpu_m…
michalkuligowski Jun 8, 2025
d63cf3e
Update hpu_model_runner.py
michalkuligowski Jun 8, 2025
c94be5b
Update hpu_model_runner.py
michalkuligowski Jun 8, 2025
52ac8f4
Fix forward for hpu_attn.py
michalkuligowski Jun 8, 2025
e44d9ab
Update llama.py
michalkuligowski Jun 9, 2025
168bf4d
Update fp8_utils.py
michalkuligowski Jun 9, 2025
cbe9b84
Update mllama.py
michalkuligowski Jun 9, 2025
0a0d780
Remove deprecated typing classes usage
michalkuligowski Jun 9, 2025
dc5bd68
Merge branch 'habana_main' into rebase_06_06
michalkuligowski Jun 9, 2025
98c7567
Fix after rebase core_client.py
michalkuligowski Jun 9, 2025
ef86c02
Merge branch 'habana_main' into rebase_06_06
michalkuligowski Jun 9, 2025
4e57516
Merge branch 'habana_main' into rebase_06_06
michalkuligowski Jun 9, 2025
b7d548d
Update fused_moe.py
michalkuligowski Jun 17, 2025
0d5584c
Update fused_moe.py
michalkuligowski Jun 17, 2025
b80e516
Merge branch 'habana_main' into rebase_06_06
michalkuligowski Jun 17, 2025
cbf152d
Update hpu_model_runner.py
michalkuligowski Jun 17, 2025
a478d5f
Update hpu_attn.py
michalkuligowski Jun 17, 2025
6038880
Update block_pool.py
michalkuligowski Jun 17, 2025
8d79007
Update hpu_model_runner.py
michalkuligowski Jun 17, 2025
f15d53e
Update mixtral.py
michalkuligowski Jun 17, 2025
349f7e5
Update hpu_attn.py
michalkuligowski Jun 17, 2025
565245f
Update qwen2_5_vl.py
michalkuligowski Jun 17, 2025
f6d0f6a
Fix for missing enum in hpu_model_runner.py
michalkuligowski Jun 17, 2025
65acff2
Update hpu_model_runner.py
michalkuligowski Jun 17, 2025
5103668
Intendation issue qwen2_5_vl.py...
michalkuligowski Jun 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
21 changes: 13 additions & 8 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

import os
import sys
Expand All @@ -8,12 +9,12 @@
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 400))
VLLM_MAX_SIZE_MB = int(os.environ.get("VLLM_MAX_SIZE_MB", 400))


def print_top_10_largest_files(zip_file):
"""Print the top 10 largest files in the given zip file."""
with zipfile.ZipFile(zip_file, 'r') as z:
with zipfile.ZipFile(zip_file, "r") as z:
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
file_sizes.sort(key=lambda x: x[1], reverse=True)
for f, size in file_sizes[:10]:
Expand All @@ -28,14 +29,18 @@ def check_wheel_size(directory):
wheel_path = os.path.join(root, file_name)
wheel_size_mb = os.path.getsize(wheel_path) / (1024 * 1024)
if wheel_size_mb > VLLM_MAX_SIZE_MB:
print(f"Not allowed: Wheel {wheel_path} is larger "
f"({wheel_size_mb:.2f} MB) than the limit "
f"({VLLM_MAX_SIZE_MB} MB).")
print(
f"Not allowed: Wheel {wheel_path} is larger "
f"({wheel_size_mb:.2f} MB) than the limit "
f"({VLLM_MAX_SIZE_MB} MB)."
)
print_top_10_largest_files(wheel_path)
return 1
else:
print(f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb:.2f} MB).")
print(
f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb:.2f} MB)."
)
return 0


Expand All @@ -45,4 +50,4 @@ def check_wheel_size(directory):
sys.exit(1)

directory = sys.argv[1]
sys.exit(check_wheel_size(directory))
sys.exit(check_wheel_size(directory))
5 changes: 3 additions & 2 deletions .buildkite/generate_index.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

import argparse
import os
Expand All @@ -22,5 +23,5 @@
print(f"Generated index.html for {args.wheel}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename,
wheel_html_escaped=filename.replace("+", "%2B")))
template.format(wheel=filename, wheel_html_escaped=filename.replace("+", "%2B"))
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Llama-3.2-1B-Instruct-FP8 -b "auto" -l 1319 -f 5 -t 1
model_name: "RedHatAI/Llama-3.2-1B-Instruct-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.335
- name: "exact_match,flexible-extract"
value: 0.323
limit: 1319
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2.5-1.5B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2.5-1.5B-Instruct -b auto -l 1319 -f 5 -t 1
model_name: "Qwen/Qwen2.5-1.5B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.54
- name: "exact_match,flexible-extract"
value: 0.59
limit: 1319
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -b auto -l 1319 -f 5 -t 1
model_name: "RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.47
- name: "exact_match,flexible-extract"
value: 0.64
limit: 1319
num_fewshot: 5
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
Qwen2-57B-A14-Instruct.yaml
DeepSeek-V2-Lite-Chat.yaml
Meta-Llama-3-8B-QQQ.yaml
8 changes: 2 additions & 6 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Qwen2.5-1.5B-Instruct.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Qwen2.5-VL-3B-Instruct-FP8-dynamic.yaml
Qwen1.5-MoE-W4A16-compressed-tensors.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
44 changes: 44 additions & 0 deletions .buildkite/lm-eval-harness/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from pathlib import Path

import pytest


def pytest_addoption(parser):
parser.addoption(
"--config-list-file",
action="store",
help="Path to the file listing model config YAMLs (one per line)",
)
parser.addoption(
"--tp-size",
action="store",
default="1",
help="Tensor parallel size to use for evaluation",
)


@pytest.fixture(scope="session")
def config_list_file(pytestconfig, config_dir):
rel_path = pytestconfig.getoption("--config-list-file")
return config_dir / rel_path


@pytest.fixture(scope="session")
def tp_size(pytestconfig):
return pytestconfig.getoption("--tp-size")


def pytest_generate_tests(metafunc):
if "config_filename" in metafunc.fixturenames:
rel_path = metafunc.config.getoption("--config-list-file")
config_list_file = Path(rel_path).resolve()
config_dir = config_list_file.parent
with open(config_list_file, encoding="utf-8") as f:
configs = [
config_dir / line.strip()
for line in f
if line.strip() and not line.startswith("#")
]
metafunc.parametrize("config_filename", configs)
59 changes: 0 additions & 59 deletions .buildkite/lm-eval-harness/run-tests.sh

This file was deleted.

62 changes: 24 additions & 38 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
@@ -1,69 +1,55 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
LM eval harness on model to compare vs HF baseline computed offline.
Configs are found in configs/$MODEL.yaml

* export LM_EVAL_TEST_DATA_FILE=configs/Meta-Llama-3-70B-Instruct.yaml
* export LM_EVAL_TP_SIZE=4
* pytest -s test_lm_eval_correctness.py
pytest -s -v test_lm_eval_correctness.py \
--config-list-file=configs/models-small.txt \
--tp-size=1
"""

import os
from pathlib import Path

import lm_eval
import numpy
import pytest
import numpy as np
import yaml

RTOL = 0.08
TEST_DATA_FILE = os.environ.get(
"LM_EVAL_TEST_DATA_FILE",
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")

TP_SIZE = os.environ.get("LM_EVAL_TP_SIZE", 1)


def launch_lm_eval(eval_config):
trust_remote_code = eval_config.get('trust_remote_code', False)

model_args = f"pretrained={eval_config['model_name']}," \
f"tensor_parallel_size={TP_SIZE}," \
f"add_bos_token=true," \
f"trust_remote_code={trust_remote_code}"

def launch_lm_eval(eval_config, tp_size):
trust_remote_code = eval_config.get("trust_remote_code", False)
model_args = (
f"pretrained={eval_config['model_name']},"
f"tensor_parallel_size={tp_size},"
f"enforce_eager=true,"
f"add_bos_token=true,"
f"trust_remote_code={trust_remote_code}"
)
results = lm_eval.simple_evaluate(
model="vllm",
model_args=model_args,
tasks=[task["name"] for task in eval_config["tasks"]],
num_fewshot=eval_config["num_fewshot"],
limit=eval_config["limit"],
batch_size="auto")

batch_size="auto",
)
return results


def test_lm_eval_correctness():
eval_config = yaml.safe_load(
Path(TEST_DATA_FILE).read_text(encoding="utf-8"))

if eval_config[
"model_name"] == "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform": #noqa: E501
pytest.skip("FBGEMM is currently failing on main.")
def test_lm_eval_correctness_param(config_filename, tp_size):
eval_config = yaml.safe_load(config_filename.read_text(encoding="utf-8"))

# Launch eval requests.
results = launch_lm_eval(eval_config)
results = launch_lm_eval(eval_config, tp_size)

# Confirm scores match ground truth.
success = True
for task in eval_config["tasks"]:
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
success = success and numpy.isclose(
ground_truth, measured_value, rtol=RTOL)
print(
f"{task['name']} | {metric['name']}: "
f"ground_truth={ground_truth} | measured={measured_value}"
)
success = success and np.isclose(ground_truth, measured_value, rtol=RTOL)

# Assert at the end, print all scores even on failure for debugging.
assert success
2 changes: 1 addition & 1 deletion .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ WARNING: The benchmarking script will save json results by itself, so please do

### Visualizing the results

The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](performance-benchmarks-descriptions.md) with real benchmarking results.
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
If you do not see the table, please wait till the benchmark finish running.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
Expand Down
Loading
Loading