vllm-project
diff --git a/‎0001-v1-Add-Whisper-model-support.patch
Lines changed: 925 additions & 0 deletions b/‎0001-v1-Add-Whisper-model-support.patch
Lines changed: 925 additions & 0 deletions
diff --git a/‎CLAUDE.md
Lines changed: 186 additions & 0 deletions b/‎CLAUDE.md
Lines changed: 186 additions & 0 deletions
@@ -0,0 +1,186 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+vLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). Originally developed at UC Berkeley's Sky Computing Lab, it's now a community-driven project under the PyTorch Foundation.
+
+**Key Technologies**: Python 3.9-3.12, PyTorch 2.7.0, CUDA/ROCm/XPU/TPU, CMake + C++/CUDA extensions
+
+## Development Commands
+
+### Installation & Setup
+```bash
+# Development installation
+pip install -e .
+
+# Install linting tools and setup pre-commit hooks
+pip install -r requirements/lint.txt
+pre-commit install
+```
+
+### Testing
+```bash
+# Run basic tests
+pytest tests/
+
+# Run with specific markers
+pytest -m core_model tests/  # Core model tests (run on every PR)
+pytest -m distributed tests/ # Distributed GPU tests
+pytest --optional tests/     # Include optional tests
+
+# Run single test file
+pytest tests/test_basic_correctness.py
+
+# Test with V1 architecture (default)
+pytest tests/
+
+# Skip V1 tests (use legacy architecture)
+pytest -m "not skip_v1" tests/
+
+# Run tests with verbose output
+pytest -v tests/
+
+# Run specific test function
+pytest tests/test_basic_correctness.py::test_function_name
+```
+
+### Code Quality
+```bash
+# Run pre-commit hooks manually
+pre-commit run --all-files
+
+# Type checking (via tools/mypy.sh)
+tools/mypy.sh 0 "local"
+
+# Format code (yapf + ruff)
+yapf --in-place --recursive vllm/
+ruff check --fix vllm/
+```
+
+### Building & Debugging
+```bash
+# Build documentation
+pip install -r docs/requirements.txt
+mkdocs serve --dev-addr localhost:8000
+
+# Debug with environment variables
+VLLM_LOGGING_LEVEL=DEBUG python ...
+CUDA_LAUNCH_BLOCKING=1 python ...  # For CUDA debugging
+
+# Incremental C++/CUDA builds (for kernel development)
+python generate_cmake_presets.py
+cmake --build build --config DevelARM64
+```
+
+## Architecture Overview
+
+### Core Components
+- **`vllm/engine/`** - Main inference engines (async/sync LLMEngine)
+- **`vllm/model_executor/`** - Model execution, layers, and model implementations
+- **`vllm/attention/`** - PagedAttention and memory management
+- **`vllm/core/`** - Scheduling and block management interfaces
+- **`vllm/worker/`** - Distributed inference workers
+- **`vllm/entrypoints/`** - API servers (OpenAI-compatible) and CLI interfaces
+- **`vllm/v1/`** - V1 architecture implementation (default)
+
+### Key Features
+- **PagedAttention**: Efficient attention key-value memory management
+- **Continuous Batching**: Dynamic request batching for throughput
+- **Multi-platform**: NVIDIA GPUs, AMD GPUs, Intel CPUs/GPUs, TPU, AWS Neuron
+- **Quantization**: GPTQ, AWQ, AutoRound, INT4/INT8/FP8 support
+- **Multi-modal**: Text, image, audio, video input processing
+
+### V1 Architecture
+vLLM V1 (alpha) provides 1.7x speedup with architectural improvements:
+- Zero-overhead prefix caching
+- Enhanced multimodal support
+- Optimized execution loop
+- Tests marked with `skip_v1` are incompatible with V1
+
+### Model Support
+- Transformer-like LLMs (Llama, Mistral, etc.)
+- Mixture-of-Expert models (Mixtral, Deepseek-V2/V3)
+- Embedding models (E5-Mistral)
+- Multi-modal models (LLaVA)
+
+## Testing Framework
+
+### Test Categories & Markers
+- `@pytest.mark.core_model` - Essential tests run on every PR
+- `@pytest.mark.distributed` - Multi-GPU distributed tests
+- `@pytest.mark.skip_v1` - Legacy tests incompatible with V1
+- `@pytest.mark.optional` - Optional tests requiring `--optional` flag
+- `@pytest.mark.cpu_model` - CPU-specific model tests
+
+### Test Structure
+- **`tests/`** - Main test directory
+- **Model-specific tests** - Located in tests/ with descriptive names
+- **Distributed tests** - Require multi-GPU setup
+- **Correctness tests** - Compare outputs against reference implementations
+- **Performance benchmarks** - Located in benchmarks/
+
+## Development Workflow
+
+### PR Classification
+Use these prefixes for pull requests:
+- `[Bugfix]` - Bug fixes
+- `[Model]` - New model support
+- `[Core]` - Core vLLM changes
+- `[Frontend]` - API/entrypoint changes
+- `[Kernel]` - CUDA/CPU kernel changes
+- `[Build]` - Build system changes
+- `[CI]` - CI/CD changes
+- `[Doc]` - Documentation updates
+
+### Pre-commit Hooks
+Enforced automatically via `.pre-commit-config.yaml`:
+- Code formatting (yapf, ruff)
+- Type checking (mypy for Python 3.9-3.12)
+- Linting (shellcheck, typos, clang-format)
+- SPDX header checks
+- Import pattern enforcement
+- Signed-off-by requirement
+
+### CI/CD Pipeline
+Located in `.github/workflows/`:
+- `lint-and-deploy.yaml` - Main CI pipeline
+- `pre-commit.yml` - Pre-commit validation
+- Extensive model compatibility testing
+- Multi-platform builds (CUDA, ROCm, CPU, etc.)
+- Performance benchmarks (triggered with `perf-benchmarks` label)
+
+## Important Notes
+
+### Code Patterns
+- Use existing quantization layers in `vllm/model_executor/layers/`
+- Follow attention patterns in `vllm/attention/`
+- Model implementations go in `vllm/model_executor/models/`
+- Multi-modal processors in `vllm/multimodal/`
+
+### Performance Considerations
+- PagedAttention for memory efficiency
+- CUDA graphs for optimized execution
+- Speculative decoding and chunked prefill support
+- Tensor and pipeline parallelism for distributed inference
+
+### V1 Compatibility
+When working with models or tests:
+- Check for `@pytest.mark.skip_v1` markers
+- V1 is the default architecture (alpha release)
+- Some legacy functionality may not be V1-compatible
+
+### Architecture Versions
+- vLLM has a V0 and V1 architecture. V0 is deprecated. 
+- Unless V0 is explicitly specified, always assume the V1 code paths are what we are interested in. 
+- The V1 specific code is under vllm/v1/. 
+- The V0 specific code paths are scattered around under vllm/.
+
+## Best Practices
+- When generating changes, do not generate lines with trailing whitespace
+- Lines should be less than 80 characters long
+- All environment variables are prefixed with `VLLM_`
+- Always run pre-commit hooks before committing
+- Include Signed-off-by in commits (DCO requirement)
+- For model contributions, include both loading and correctness tests