Skip to content

Commit f9a5f4e

Browse files
committed
Automate choice of attention block size; update docs
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
1 parent 2f35a02 commit f9a5f4e

File tree

4 files changed

+130
-82
lines changed

4 files changed

+130
-82
lines changed

docs/models/supported_models.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -319,7 +319,7 @@ Specified using `--task generate`.
319319
| `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
320320
| `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | ✅︎ |
321321
| `BaiChuanForCausalLM` | Baichuan2, Baichuan | `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
322-
| `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | |
322+
| `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | ✅︎ |
323323
| `BloomForCausalLM` | BLOOM, BLOOMZ, BLOOMChat | `bigscience/bloom`, `bigscience/bloomz`, etc. | | ✅︎ | |
324324
| `BartForConditionalGeneration` | BART | `facebook/bart-base`, `facebook/bart-large-cnn`, etc. | | | |
325325
| `ChatGLMModel`, `ChatGLMForConditionalGeneration` | ChatGLM | `THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, `ShieldLM-6B-chatglm3`, etc. | ✅︎ | ✅︎ | ✅︎ |
@@ -335,7 +335,7 @@ Specified using `--task generate`.
335335
| `ExaoneForCausalLM` | EXAONE-3 | `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
336336
| `FalconForCausalLM` | Falcon | `tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc. | | ✅︎ | ✅︎ |
337337
| `FalconMambaForCausalLM` | FalconMamba | `tiiuae/falcon-mamba-7b`, `tiiuae/falcon-mamba-7b-instruct`, etc. | | ✅︎ | ✅︎ |
338-
| `FalconH1ForCausalLM` | Falcon-H1 | `tiiuae/Falcon-H1-34B-Base`, `tiiuae/Falcon-H1-34B-Instruct`, etc. | ✅︎ | ✅︎ | |
338+
| `FalconH1ForCausalLM` | Falcon-H1 | `tiiuae/Falcon-H1-34B-Base`, `tiiuae/Falcon-H1-34B-Instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
339339
| `GemmaForCausalLM` | Gemma | `google/gemma-2b`, `google/gemma-1.1-2b-it`, etc. | ✅︎ | ✅︎ | ✅︎ |
340340
| `Gemma2ForCausalLM` | Gemma 2 | `google/gemma-2-9b`, `google/gemma-2-27b`, etc. | ✅︎ | ✅︎ | ✅︎ |
341341
| `Gemma3ForCausalLM` | Gemma 3 | `google/gemma-3-1b-it`, etc. | ✅︎ | ✅︎ | ✅︎ |
@@ -348,7 +348,7 @@ Specified using `--task generate`.
348348
| `GPTNeoXForCausalLM` | GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM | `EleutherAI/gpt-neox-20b`, `EleutherAI/pythia-12b`, `OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc. | | ✅︎ | ✅︎ |
349349
| `GraniteForCausalLM` | Granite 3.0, Granite 3.1, PowerLM | `ibm-granite/granite-3.0-2b-base`, `ibm-granite/granite-3.1-8b-instruct`, `ibm/PowerLM-3b`, etc. | ✅︎ | ✅︎ | ✅︎ |
350350
| `GraniteMoeForCausalLM` | Granite 3.0 MoE, PowerMoE | `ibm-granite/granite-3.0-1b-a400m-base`, `ibm-granite/granite-3.0-3b-a800m-instruct`, `ibm/PowerMoE-3b`, etc. | ✅︎ | ✅︎ | ✅︎ |
351-
| `GraniteMoeHybridForCausalLM` | Granite 4.0 MoE Hybrid | `ibm-granite/granite-4.0-tiny-preview`, etc. | ✅︎ | ✅︎ | |
351+
| `GraniteMoeHybridForCausalLM` | Granite 4.0 MoE Hybrid | `ibm-granite/granite-4.0-tiny-preview`, etc. | ✅︎ | ✅︎ | ✅︎ |
352352
| `GraniteMoeSharedForCausalLM` | Granite MoE Shared | `ibm-research/moe-7b-1b-active-shared-experts` (test model) | ✅︎ | ✅︎ | ✅︎ |
353353
| `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | |
354354
| `Grok1ModelForCausalLM` | Grok1 | `hpcai-tech/grok-1`. | ✅︎ | ✅︎ | ✅︎ |
@@ -367,7 +367,7 @@ Specified using `--task generate`.
367367
| `MixtralForCausalLM` | Mixtral-8x7B, Mixtral-8x7B-Instruct | `mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistral-community/Mixtral-8x22B-v0.1`, etc. | ✅︎ | ✅︎ | ✅︎ |
368368
| `MPTForCausalLM` | MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter | `mosaicml/mpt-7b`, `mosaicml/mpt-7b-storywriter`, `mosaicml/mpt-30b`, etc. | | ✅︎ | ✅︎ |
369369
| `NemotronForCausalLM` | Nemotron-3, Nemotron-4, Minitron | `nvidia/Minitron-8B-Base`, `mgoin/Nemotron-4-340B-Base-hf-FP8`, etc. | ✅︎ | ✅︎ | ✅︎ |
370-
| `NemotronHForCausalLM` | Nemotron-H | `nvidia/Nemotron-H-8B-Base-8K`, `nvidia/Nemotron-H-47B-Base-8K`, `nvidia/Nemotron-H-56B-Base-8K`, etc. | ✅︎ | ✅︎ | |
370+
| `NemotronHForCausalLM` | Nemotron-H | `nvidia/Nemotron-H-8B-Base-8K`, `nvidia/Nemotron-H-47B-Base-8K`, `nvidia/Nemotron-H-56B-Base-8K`, etc. | ✅︎ | ✅︎ | ✅︎ |
371371
| `OLMoForCausalLM` | OLMo | `allenai/OLMo-1B-hf`, `allenai/OLMo-7B-hf`, etc. | | ✅︎ | ✅︎ |
372372
| `OLMo2ForCausalLM` | OLMo2 | `allenai/OLMo-2-0425-1B`, etc. | | ✅︎ | ✅︎ |
373373
| `OLMoEForCausalLM` | OLMoE | `allenai/OLMoE-1B-7B-0924`, `allenai/OLMoE-1B-7B-0924-Instruct`, etc. | | ✅︎ | ✅︎ |
@@ -392,7 +392,7 @@ Specified using `--task generate`.
392392
| `XverseForCausalLM` | XVERSE | `xverse/XVERSE-7B-Chat`, `xverse/XVERSE-13B-Chat`, `xverse/XVERSE-65B-Chat`, etc. | ✅︎ | ✅︎ | ✅︎ |
393393
| `MiniMaxM1ForCausalLM` | MiniMax-Text | `MiniMaxAI/MiniMax-M1-40k`, `MiniMaxAI/MiniMax-M1-80k`etc. | | | |
394394
| `MiniMaxText01ForCausalLM` | MiniMax-Text | `MiniMaxAI/MiniMax-Text-01`, etc. | | | |
395-
| `Zamba2ForCausalLM` | Zamba2 | `Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc. | | | |
395+
| `Zamba2ForCausalLM` | Zamba2 | `Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc. | | |✅︎ |
396396

397397
!!! note
398398
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.

docs/usage/v1_guide.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,8 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
8383
| **Decoder-only Models** | <nobr>🚀 Optimized</nobr> |
8484
| **Encoder-Decoder Models** | <nobr>🟠 Delayed</nobr> |
8585
| **Embedding Models** | <nobr>🟢 Functional</nobr> |
86-
| **Mamba Models** | <nobr>🚧 WIP ([PR #19327](https://github.com/vllm-project/vllm/pull/19327))</nobr> |
86+
| **Mamba Models** | <nobr>🟢 Functional</nobr> |
87+
| **Hybrid Models** | <nobr>🟢 Functional</nobr> |
8788
| **Multimodal Models** | <nobr>🟢 Functional</nobr> |
8889

8990
vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol.
@@ -104,8 +105,16 @@ to enable simultaneous generation and embedding using the same engine instance i
104105

105106
#### Mamba Models
106107

107-
Models using selective state-space mechanisms instead of standard transformer attention (e.g., `MambaForCausalLM`, `JambaForCausalLM`)
108-
will be supported via [PR #19327](https://github.com/vllm-project/vllm/pull/19327).
108+
Models using selective state-space mechanisms instead of standard transformer attention are partially supported.
109+
Models that use Mamba-2 layers (e.g., `Mamba2ForCausalLM`) are supported, but models that use older Mamba-1 layers
110+
(e.g., `MambaForCausalLM`, `JambaForCausalLM`) are not yet suported. Please note that these models currently require
111+
enforcing eager mode and disabling prefix caching in V1.
112+
113+
#### Hybrid Models
114+
115+
Models that combined Mamba-2 layers with standard transformer attention layers are supported (e.g., `BambaForCausalLM`,
116+
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`). Please note that
117+
these models currently require enforcing eager mode and disabling prefix caching in V1.
109118

110119
#### Encoder-Decoder Models
111120

tests/models/language/generation/test_hybrid.py

Lines changed: 1 addition & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -61,14 +61,6 @@
6161
"tiiuae/Falcon-H1-0.5B-Base",
6262
]
6363

64-
ATTN_BLOCK_SIZES = {
65-
"ibm-ai-platform/Bamba-9B-v1": 528,
66-
"Zyphra/Zamba2-1.2B-instruct": 80,
67-
"nvidia/Nemotron-H-8B-Base-8K": 528,
68-
"ibm-granite/granite-4.0-tiny-preview": 400,
69-
"tiiuae/Falcon-H1-0.5B-Base": 800,
70-
}
71-
7264
# Avoid OOM
7365
MAX_NUM_SEQS = 4
7466

@@ -105,11 +97,6 @@ def test_models(
10597
example_prompts, max_tokens, num_logprobs)
10698

10799
if model in V1_SUPPORTED_MODELS:
108-
if model in HYBRID_MODELS and model in ATTN_BLOCK_SIZES:
109-
block_size = ATTN_BLOCK_SIZES[model]
110-
else:
111-
block_size = 16
112-
113100
with monkeypatch.context() as m:
114101
m.setenv("VLLM_USE_V1", "1")
115102
if model in HYBRID_MODELS:
@@ -118,8 +105,7 @@ def test_models(
118105
with vllm_runner(model,
119106
max_num_seqs=MAX_NUM_SEQS,
120107
enforce_eager=True,
121-
enable_prefix_caching=False,
122-
block_size=block_size) as vllm_model:
108+
enable_prefix_caching=False) as vllm_model:
123109
vllm_v1_outputs = vllm_model.generate_greedy_logprobs(
124110
example_prompts, max_tokens, num_logprobs)
125111
else:

0 commit comments

Comments
 (0)