vLLM implementation for Diffusion LLMs, D2F is integrated as the core inference strategy, while also support training-free strategies like Fast-dLLM.
Based on Nano-vLLM.
pip install d2f_vllmWe use UV to manage the whole project.
uv sync
source .venv/bin/activate
uv pip install -e .For easy-activation:
echo "alias uvon=source .venv/bin/activate" >> ~/.zshrc # If using bash, change to .bashrc
source ~/.zshrcThen, use uvon under the project root path to activate.
uv pip install vllmD2F-vLLM still depends on some modules of vLLM, however, there are some problems lies in UV venv management, thus we have to install vLLM independently.
uv pip install flash-attn --no-build-isolationIf not working, build flash-attn from scratch. This may take some while (most of the time is cost on compiling cutlass).
git submodule update --init --recursive
cd third_party/flash-attn
MAX_JOBS=$(nproc) python setup.py install --verboseSetting add_new_block_threshold<1.0, together with our D2F training strategy, enables support for the D2F-specific decoding paradigm.
In contrast, setting add_new_block_threshold=1.0 allows compatibility with Fast-dLLM inference, which is Training-free.
- Implement KV Cache loading kernel
- Tensor Parallel
- Data Parallel
- Implement Async Engine and Streaming Generation
- Faster Flash Attention Kernel
- Diffusion LM CUDA Graph Capturing
