-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Description
This page is accessible via roadmap.vllm.ai
This is a living document! For each item here, we intend to link the RFC as well as discussion Slack channel in the vLLM Slack
Core Themes
In Q3, we continue to iterate towards vLLM 1.0 by fully removing the V0 code path, optimizing and extending the core scheduler, making sure vLLM can serve the most demanding workloads of the world, and enhancing the out of box usability and performance.
V1 Engine
- V0 Features Parity and Native Features (#sig-v1)
- Pooling Model
- Mamba Model ([v1] Support mamba2 #19327)
- Priority Scheduling ([Core] feat: Implement Priority Scheduling in V1 Engine #19057)
- Custom Logits Processing
- CPU KV Cache
- Investigate Encoder-Decoder Support (v1: Add Whisper model support (encoder-decoder) #21088)
- Performance
- Async Scheduling
- Optimize Input Preparation (Persistent Batch V2)
- Speculative Decoding Enhancements (Suffix Decoding, CUDA Graph/torch.compile support)
- Multimodal Processing
- Simplification
- Parallel Input Processing
- Reduce Serialization & Broadcasting Overheads
- Investigate streaming input and output
- Design Documentation
- Hybrid Memory Allocator
- Core Scheduler Design
- Speculative Decoding Design
- Hardware Platform Guide
- Model API Guide
User Experience
- Fit and Finish (#feat-startup-ux) - [Feature]: Improve startup time UX #19824
- Fast startup
- Clean startup log
- Clean up configuration items
- Performance Tuning Guide
- Accelerator UX Audit and Document Feature Coverage
- Stability and Testing
- Comprehensive Reproducible Performance Suite
- Enhance and Report Accuracy Suite
- Large Scale Deployment Tested in CI
- Stress and Longevity Testing
- Improve the Stability of the vLLM-torch.compile Integration.
- Robust Tool Use Parsing
- Operational Experience
- Request Level SLO Targeting & Enhanced Autoscaling/Tuning
- Improve Logging and Tracing Code Path
- Debugging tool for perf profiling and numerics
Large Scale Serving
- Stable Scale-Out Serving for Mixture-of-Experts Models
- Enhance and Document Data Parallelism
- Stabilize Expert Parallelism Routing x GEMM Options
- Expert Parallel Load Balancing ([Feature] Expert Parallelism Load Balancer (EPLB) #18343)
- Transfer KV Cache through CPU
- Communication & Computation Overlap
- Disaggregated Serving
- Standardized Dataflow between P <> D
- Autoscaling P & D Replicas
- Multi-modality Support
- Speculative Decoding Support
- Prefill-Only Mode
- Elastic EP and Fault Tolerance
- Enhancement to the KV Transfer API for Request Migration and KV Cache Priming
- Investigate Context Parallelism
Features
Models
- Support Multiple Training and Model Authoring Frameworks by Opening up the Interface for Tokenizer, Configuration, and Processor.
- Investigate Sparse Attention Mechanism
- Performance Enhancements for Small Models (<1B scale)
Hardwares
- NVIDIA
- Enhance Blackwell Support
- GB200 NVL72
- AMD
- MI350X: MXFP4 Support
- Large Scale Serving Support
- Official Wheels and Container Distributed
- Document Feature Party and Performance Numbers
- TPU
- Progress in Ironwood Support
- Official Wheels and Container Distributed
- Document Feature Party and Performance Numbers
- Neuron
- Plugin for V1 Architecture
- Document Feature Party and Performance Numbers
- Intel
- Stable CPU Release with Wheels and Containers
- Stable XPU Support
- HPU (Gaudi) Move to Plugin
- Platform Plugins
- Stable and Tested Interfaces
Use Cases
- RLHF
- Test Popular Framework that Integrate with vLLM for Performance and Prevent Breakages
- Weight loading optimization for syncing and resharding
- Custom checkpoint loader, custom model format
- Multi-turn scheduling
- Evaluation
- Support Full Determinism (with/without prefix cache) Regardless of Batching Order
- Batch Inference
- Simple Data Parallel Router for Scale Out with Prefix Caching
- CPU KV cache offloading
- Explore Bundling of Configuration for Specializations
- Low Latency Code Completion
- High Throughput Multi-turn Agentic Rollout
- Large Scale Image and Video Understanding
- Transformer Based Item Recommendation
FAQ
When will vLLM release 1.0?
We believe 1.0 means API stability and great user experience. We will not commit a date for the exact release. The criteria for 1.0 are:
- Stable user facing API such as CLI, LLM, AsyncLLMEngine .
- Stable developer API such as logits processor, kv connector, model interface, hardware/platform plugin interfaces
- Polished out-of-box user experience.
If any of the items you wanted is not on the roadmap, your suggestion and contribution is strongly welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.
Historical Roadmap: #15735, #11862, #9006, #5805, #3861, #2681, #244