v0.8.4rc1
Pre-release
Pre-release
This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the official doc to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the official documentation.
Highlights
- vLLM V1 engine experimental support is included in this version. You can visit official guide to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set
VLLM_USE_V1=1
environment if you want to use V1 forcely. - LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. #521.
- Sleep Mode feature is supported. Currently it's only work on V0 engine. V1 engine support will come soon. #513
Core
- The Ascend scheduler is added for V1 engine. This scheduler is more affinity with Ascend hardware. More scheduler policy will be added in the future. #543
- Disaggregated Prefill feature is supported. Currently only 1P1D works. NPND is under design by vllm team. vllm-ascend will support it once it's ready from vLLM. Follow the official guide to use. #432
- Spec decode feature works now. Currently it's only work on V0 engine. V1 engine support will come soon. #500
- Structured output feature works now on V1 Engine. Currently it only supports xgrammar backend while using guidance backend may get some errors. #555
Other
- A new communicator
pyhccl
is added. It's used for call CANN HCCL library directly instead of usingtorch.distribute
. More usage of it will be added in the next release #503 - The custom ops build is enabled by default. You should install the packages like
gcc
,cmake
first to buildvllm-ascend
from source. SetCOMPILE_CUSTOM_KERNELS=0
environment to disable the compilation if you don't need it. #466 - The custom op
rotay embedding
is enabled by default now to improve the performance. #555