Skip to content

Commit 8b080c3

Browse files
committed
v1: Add Whisper model support (encoder-decoder)
This brings Whisper support to V1 to close one of the remaining feature gaps with V0. Most of the changes apply to encoder-decoder models generally, though Whisper is the only one explicitly tested and is the only encoder-decoder model updated to support V1. **Whisper Model Implementation:** - Remove SupportsV0Only interface constraint to enable V1 compatibility - Update get_multimodal_embeddings() to return list format required by V1 **Flash Attention Backend:** - Add encoder attention metadata fields (encoder_seq_start_loc, max_encoder_seq_len, cross_slot_mapping) - Implement encoder self-attention support without using KV cache - Add cross-attention support for encoder-decoder models with proper KV cache handling **KV Cache Manager:** - Introduce CrossAttentionManager for handling cross-attention KV cache in encoder-decoder models - Add CrossAttentionSpec for cross-attention cache specification with encoder-based sizing - Implement allocate_slots_for_cross_attn() for static encoder-length-based allocation - Add cross-attention block allocation logic separate from decoder token growth **Scheduler:** - Disable prefix caching for encoder-decoder models - Implement cross-attention block allocation during request scheduling - Add cross-attention block tracking in state management **GPU Model Runner:** - Add encoder input extraction for audio features processing - Implement encoder attention metadata building for both self-attention and cross-attention - Add cross-attention KV cache group handling with proper slot mapping - Modify input batch creation to accommodate encoder sequence lengths - Add encoder input processing in forward pass with proper device/dtype handling - Update profiling and memory management for encoder-decoder models The implementation maintains backward compatibility while adding comprehensive encoder-decoder support, with particular focus on Whisper's audio processing pipeline and cross-attention mechanisms between encoder and decoder. Related to: - V0 deprecation: #18571 - 2025 Q3 roadmap: #20336 Signed-off-by: Russell Bryant <rbryant@redhat.com>
1 parent 0f199f1 commit 8b080c3

34 files changed

+3871
-73
lines changed

0001-v1-Add-Whisper-model-support.patch

Lines changed: 925 additions & 0 deletions
Large diffs are not rendered by default.

CLAUDE.md

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
vLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). Originally developed at UC Berkeley's Sky Computing Lab, it's now a community-driven project under the PyTorch Foundation.
8+
9+
**Key Technologies**: Python 3.9-3.12, PyTorch 2.7.0, CUDA/ROCm/XPU/TPU, CMake + C++/CUDA extensions
10+
11+
## Development Commands
12+
13+
### Installation & Setup
14+
```bash
15+
# Development installation
16+
pip install -e .
17+
18+
# Install linting tools and setup pre-commit hooks
19+
pip install -r requirements/lint.txt
20+
pre-commit install
21+
```
22+
23+
### Testing
24+
```bash
25+
# Run basic tests
26+
pytest tests/
27+
28+
# Run with specific markers
29+
pytest -m core_model tests/ # Core model tests (run on every PR)
30+
pytest -m distributed tests/ # Distributed GPU tests
31+
pytest --optional tests/ # Include optional tests
32+
33+
# Run single test file
34+
pytest tests/test_basic_correctness.py
35+
36+
# Test with V1 architecture (default)
37+
pytest tests/
38+
39+
# Skip V1 tests (use legacy architecture)
40+
pytest -m "not skip_v1" tests/
41+
42+
# Run tests with verbose output
43+
pytest -v tests/
44+
45+
# Run specific test function
46+
pytest tests/test_basic_correctness.py::test_function_name
47+
```
48+
49+
### Code Quality
50+
```bash
51+
# Run pre-commit hooks manually
52+
pre-commit run --all-files
53+
54+
# Type checking (via tools/mypy.sh)
55+
tools/mypy.sh 0 "local"
56+
57+
# Format code (yapf + ruff)
58+
yapf --in-place --recursive vllm/
59+
ruff check --fix vllm/
60+
```
61+
62+
### Building & Debugging
63+
```bash
64+
# Build documentation
65+
pip install -r docs/requirements.txt
66+
mkdocs serve --dev-addr localhost:8000
67+
68+
# Debug with environment variables
69+
VLLM_LOGGING_LEVEL=DEBUG python ...
70+
CUDA_LAUNCH_BLOCKING=1 python ... # For CUDA debugging
71+
72+
# Incremental C++/CUDA builds (for kernel development)
73+
python generate_cmake_presets.py
74+
cmake --build build --config DevelARM64
75+
```
76+
77+
## Architecture Overview
78+
79+
### Core Components
80+
- **`vllm/engine/`** - Main inference engines (async/sync LLMEngine)
81+
- **`vllm/model_executor/`** - Model execution, layers, and model implementations
82+
- **`vllm/attention/`** - PagedAttention and memory management
83+
- **`vllm/core/`** - Scheduling and block management interfaces
84+
- **`vllm/worker/`** - Distributed inference workers
85+
- **`vllm/entrypoints/`** - API servers (OpenAI-compatible) and CLI interfaces
86+
- **`vllm/v1/`** - V1 architecture implementation (default)
87+
88+
### Key Features
89+
- **PagedAttention**: Efficient attention key-value memory management
90+
- **Continuous Batching**: Dynamic request batching for throughput
91+
- **Multi-platform**: NVIDIA GPUs, AMD GPUs, Intel CPUs/GPUs, TPU, AWS Neuron
92+
- **Quantization**: GPTQ, AWQ, AutoRound, INT4/INT8/FP8 support
93+
- **Multi-modal**: Text, image, audio, video input processing
94+
95+
### V1 Architecture
96+
vLLM V1 (alpha) provides 1.7x speedup with architectural improvements:
97+
- Zero-overhead prefix caching
98+
- Enhanced multimodal support
99+
- Optimized execution loop
100+
- Tests marked with `skip_v1` are incompatible with V1
101+
102+
### Model Support
103+
- Transformer-like LLMs (Llama, Mistral, etc.)
104+
- Mixture-of-Expert models (Mixtral, Deepseek-V2/V3)
105+
- Embedding models (E5-Mistral)
106+
- Multi-modal models (LLaVA)
107+
108+
## Testing Framework
109+
110+
### Test Categories & Markers
111+
- `@pytest.mark.core_model` - Essential tests run on every PR
112+
- `@pytest.mark.distributed` - Multi-GPU distributed tests
113+
- `@pytest.mark.skip_v1` - Legacy tests incompatible with V1
114+
- `@pytest.mark.optional` - Optional tests requiring `--optional` flag
115+
- `@pytest.mark.cpu_model` - CPU-specific model tests
116+
117+
### Test Structure
118+
- **`tests/`** - Main test directory
119+
- **Model-specific tests** - Located in tests/ with descriptive names
120+
- **Distributed tests** - Require multi-GPU setup
121+
- **Correctness tests** - Compare outputs against reference implementations
122+
- **Performance benchmarks** - Located in benchmarks/
123+
124+
## Development Workflow
125+
126+
### PR Classification
127+
Use these prefixes for pull requests:
128+
- `[Bugfix]` - Bug fixes
129+
- `[Model]` - New model support
130+
- `[Core]` - Core vLLM changes
131+
- `[Frontend]` - API/entrypoint changes
132+
- `[Kernel]` - CUDA/CPU kernel changes
133+
- `[Build]` - Build system changes
134+
- `[CI]` - CI/CD changes
135+
- `[Doc]` - Documentation updates
136+
137+
### Pre-commit Hooks
138+
Enforced automatically via `.pre-commit-config.yaml`:
139+
- Code formatting (yapf, ruff)
140+
- Type checking (mypy for Python 3.9-3.12)
141+
- Linting (shellcheck, typos, clang-format)
142+
- SPDX header checks
143+
- Import pattern enforcement
144+
- Signed-off-by requirement
145+
146+
### CI/CD Pipeline
147+
Located in `.github/workflows/`:
148+
- `lint-and-deploy.yaml` - Main CI pipeline
149+
- `pre-commit.yml` - Pre-commit validation
150+
- Extensive model compatibility testing
151+
- Multi-platform builds (CUDA, ROCm, CPU, etc.)
152+
- Performance benchmarks (triggered with `perf-benchmarks` label)
153+
154+
## Important Notes
155+
156+
### Code Patterns
157+
- Use existing quantization layers in `vllm/model_executor/layers/`
158+
- Follow attention patterns in `vllm/attention/`
159+
- Model implementations go in `vllm/model_executor/models/`
160+
- Multi-modal processors in `vllm/multimodal/`
161+
162+
### Performance Considerations
163+
- PagedAttention for memory efficiency
164+
- CUDA graphs for optimized execution
165+
- Speculative decoding and chunked prefill support
166+
- Tensor and pipeline parallelism for distributed inference
167+
168+
### V1 Compatibility
169+
When working with models or tests:
170+
- Check for `@pytest.mark.skip_v1` markers
171+
- V1 is the default architecture (alpha release)
172+
- Some legacy functionality may not be V1-compatible
173+
174+
### Architecture Versions
175+
- vLLM has a V0 and V1 architecture. V0 is deprecated.
176+
- Unless V0 is explicitly specified, always assume the V1 code paths are what we are interested in.
177+
- The V1 specific code is under vllm/v1/.
178+
- The V0 specific code paths are scattered around under vllm/.
179+
180+
## Best Practices
181+
- When generating changes, do not generate lines with trailing whitespace
182+
- Lines should be less than 80 characters long
183+
- All environment variables are prefixed with `VLLM_`
184+
- Always run pre-commit hooks before committing
185+
- Include Signed-off-by in commits (DCO requirement)
186+
- For model contributions, include both loading and correctness tests

0 commit comments

Comments
 (0)