|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +vLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). Originally developed at UC Berkeley's Sky Computing Lab, it's now a community-driven project under the PyTorch Foundation. |
| 8 | + |
| 9 | +**Key Technologies**: Python 3.9-3.12, PyTorch 2.7.0, CUDA/ROCm/XPU/TPU, CMake + C++/CUDA extensions |
| 10 | + |
| 11 | +## Development Commands |
| 12 | + |
| 13 | +### Installation & Setup |
| 14 | +```bash |
| 15 | +# Development installation |
| 16 | +pip install -e . |
| 17 | + |
| 18 | +# Install linting tools and setup pre-commit hooks |
| 19 | +pip install -r requirements/lint.txt |
| 20 | +pre-commit install |
| 21 | +``` |
| 22 | + |
| 23 | +### Testing |
| 24 | +```bash |
| 25 | +# Run basic tests |
| 26 | +pytest tests/ |
| 27 | + |
| 28 | +# Run with specific markers |
| 29 | +pytest -m core_model tests/ # Core model tests (run on every PR) |
| 30 | +pytest -m distributed tests/ # Distributed GPU tests |
| 31 | +pytest --optional tests/ # Include optional tests |
| 32 | + |
| 33 | +# Run single test file |
| 34 | +pytest tests/test_basic_correctness.py |
| 35 | + |
| 36 | +# Test with V1 architecture (default) |
| 37 | +pytest tests/ |
| 38 | + |
| 39 | +# Skip V1 tests (use legacy architecture) |
| 40 | +pytest -m "not skip_v1" tests/ |
| 41 | + |
| 42 | +# Run tests with verbose output |
| 43 | +pytest -v tests/ |
| 44 | + |
| 45 | +# Run specific test function |
| 46 | +pytest tests/test_basic_correctness.py::test_function_name |
| 47 | +``` |
| 48 | + |
| 49 | +### Code Quality |
| 50 | +```bash |
| 51 | +# Run pre-commit hooks manually |
| 52 | +pre-commit run --all-files |
| 53 | + |
| 54 | +# Type checking (via tools/mypy.sh) |
| 55 | +tools/mypy.sh 0 "local" |
| 56 | + |
| 57 | +# Format code (yapf + ruff) |
| 58 | +yapf --in-place --recursive vllm/ |
| 59 | +ruff check --fix vllm/ |
| 60 | +``` |
| 61 | + |
| 62 | +### Building & Debugging |
| 63 | +```bash |
| 64 | +# Build documentation |
| 65 | +pip install -r docs/requirements.txt |
| 66 | +mkdocs serve --dev-addr localhost:8000 |
| 67 | + |
| 68 | +# Debug with environment variables |
| 69 | +VLLM_LOGGING_LEVEL=DEBUG python ... |
| 70 | +CUDA_LAUNCH_BLOCKING=1 python ... # For CUDA debugging |
| 71 | + |
| 72 | +# Incremental C++/CUDA builds (for kernel development) |
| 73 | +python generate_cmake_presets.py |
| 74 | +cmake --build build --config DevelARM64 |
| 75 | +``` |
| 76 | + |
| 77 | +## Architecture Overview |
| 78 | + |
| 79 | +### Core Components |
| 80 | +- **`vllm/engine/`** - Main inference engines (async/sync LLMEngine) |
| 81 | +- **`vllm/model_executor/`** - Model execution, layers, and model implementations |
| 82 | +- **`vllm/attention/`** - PagedAttention and memory management |
| 83 | +- **`vllm/core/`** - Scheduling and block management interfaces |
| 84 | +- **`vllm/worker/`** - Distributed inference workers |
| 85 | +- **`vllm/entrypoints/`** - API servers (OpenAI-compatible) and CLI interfaces |
| 86 | +- **`vllm/v1/`** - V1 architecture implementation (default) |
| 87 | + |
| 88 | +### Key Features |
| 89 | +- **PagedAttention**: Efficient attention key-value memory management |
| 90 | +- **Continuous Batching**: Dynamic request batching for throughput |
| 91 | +- **Multi-platform**: NVIDIA GPUs, AMD GPUs, Intel CPUs/GPUs, TPU, AWS Neuron |
| 92 | +- **Quantization**: GPTQ, AWQ, AutoRound, INT4/INT8/FP8 support |
| 93 | +- **Multi-modal**: Text, image, audio, video input processing |
| 94 | + |
| 95 | +### V1 Architecture |
| 96 | +vLLM V1 (alpha) provides 1.7x speedup with architectural improvements: |
| 97 | +- Zero-overhead prefix caching |
| 98 | +- Enhanced multimodal support |
| 99 | +- Optimized execution loop |
| 100 | +- Tests marked with `skip_v1` are incompatible with V1 |
| 101 | + |
| 102 | +### Model Support |
| 103 | +- Transformer-like LLMs (Llama, Mistral, etc.) |
| 104 | +- Mixture-of-Expert models (Mixtral, Deepseek-V2/V3) |
| 105 | +- Embedding models (E5-Mistral) |
| 106 | +- Multi-modal models (LLaVA) |
| 107 | + |
| 108 | +## Testing Framework |
| 109 | + |
| 110 | +### Test Categories & Markers |
| 111 | +- `@pytest.mark.core_model` - Essential tests run on every PR |
| 112 | +- `@pytest.mark.distributed` - Multi-GPU distributed tests |
| 113 | +- `@pytest.mark.skip_v1` - Legacy tests incompatible with V1 |
| 114 | +- `@pytest.mark.optional` - Optional tests requiring `--optional` flag |
| 115 | +- `@pytest.mark.cpu_model` - CPU-specific model tests |
| 116 | + |
| 117 | +### Test Structure |
| 118 | +- **`tests/`** - Main test directory |
| 119 | +- **Model-specific tests** - Located in tests/ with descriptive names |
| 120 | +- **Distributed tests** - Require multi-GPU setup |
| 121 | +- **Correctness tests** - Compare outputs against reference implementations |
| 122 | +- **Performance benchmarks** - Located in benchmarks/ |
| 123 | + |
| 124 | +## Development Workflow |
| 125 | + |
| 126 | +### PR Classification |
| 127 | +Use these prefixes for pull requests: |
| 128 | +- `[Bugfix]` - Bug fixes |
| 129 | +- `[Model]` - New model support |
| 130 | +- `[Core]` - Core vLLM changes |
| 131 | +- `[Frontend]` - API/entrypoint changes |
| 132 | +- `[Kernel]` - CUDA/CPU kernel changes |
| 133 | +- `[Build]` - Build system changes |
| 134 | +- `[CI]` - CI/CD changes |
| 135 | +- `[Doc]` - Documentation updates |
| 136 | + |
| 137 | +### Pre-commit Hooks |
| 138 | +Enforced automatically via `.pre-commit-config.yaml`: |
| 139 | +- Code formatting (yapf, ruff) |
| 140 | +- Type checking (mypy for Python 3.9-3.12) |
| 141 | +- Linting (shellcheck, typos, clang-format) |
| 142 | +- SPDX header checks |
| 143 | +- Import pattern enforcement |
| 144 | +- Signed-off-by requirement |
| 145 | + |
| 146 | +### CI/CD Pipeline |
| 147 | +Located in `.github/workflows/`: |
| 148 | +- `lint-and-deploy.yaml` - Main CI pipeline |
| 149 | +- `pre-commit.yml` - Pre-commit validation |
| 150 | +- Extensive model compatibility testing |
| 151 | +- Multi-platform builds (CUDA, ROCm, CPU, etc.) |
| 152 | +- Performance benchmarks (triggered with `perf-benchmarks` label) |
| 153 | + |
| 154 | +## Important Notes |
| 155 | + |
| 156 | +### Code Patterns |
| 157 | +- Use existing quantization layers in `vllm/model_executor/layers/` |
| 158 | +- Follow attention patterns in `vllm/attention/` |
| 159 | +- Model implementations go in `vllm/model_executor/models/` |
| 160 | +- Multi-modal processors in `vllm/multimodal/` |
| 161 | + |
| 162 | +### Performance Considerations |
| 163 | +- PagedAttention for memory efficiency |
| 164 | +- CUDA graphs for optimized execution |
| 165 | +- Speculative decoding and chunked prefill support |
| 166 | +- Tensor and pipeline parallelism for distributed inference |
| 167 | + |
| 168 | +### V1 Compatibility |
| 169 | +When working with models or tests: |
| 170 | +- Check for `@pytest.mark.skip_v1` markers |
| 171 | +- V1 is the default architecture (alpha release) |
| 172 | +- Some legacy functionality may not be V1-compatible |
| 173 | + |
| 174 | +### Architecture Versions |
| 175 | +- vLLM has a V0 and V1 architecture. V0 is deprecated. |
| 176 | +- Unless V0 is explicitly specified, always assume the V1 code paths are what we are interested in. |
| 177 | +- The V1 specific code is under vllm/v1/. |
| 178 | +- The V0 specific code paths are scattered around under vllm/. |
| 179 | + |
| 180 | +## Best Practices |
| 181 | +- When generating changes, do not generate lines with trailing whitespace |
| 182 | +- Lines should be less than 80 characters long |
| 183 | +- All environment variables are prefixed with `VLLM_` |
| 184 | +- Always run pre-commit hooks before committing |
| 185 | +- Include Signed-off-by in commits (DCO requirement) |
| 186 | +- For model contributions, include both loading and correctness tests |
0 commit comments