vllm-project
diff --git a/‎.github/workflows/vllm_ascend_test.yaml
Lines changed: 13 additions & 4 deletions b/‎.github/workflows/vllm_ascend_test.yaml
Lines changed: 13 additions & 4 deletions
diff --git a/‎docs/source/faqs.md
Lines changed: 49 additions & 1 deletion b/‎docs/source/faqs.md
Lines changed: 49 additions & 1 deletion
diff --git a/‎docs/source/index.md
Lines changed: 1 addition & 0 deletions b/‎docs/source/index.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/user_guide/env_vars.md
Lines changed: 9 additions & 0 deletions b/‎docs/source/user_guide/env_vars.md
Lines changed: 9 additions & 0 deletions
diff --git a/‎examples/offline_disaggregated_prefill_npu.py
Lines changed: 140 additions & 0 deletions b/‎examples/offline_disaggregated_prefill_npu.py
Lines changed: 140 additions & 0 deletions
diff --git a/‎pyproject.toml
Lines changed: 1 addition & 0 deletions b/‎pyproject.toml
Lines changed: 1 addition & 0 deletions
diff --git a/‎requirements.txt
Lines changed: 1 addition & 0 deletions b/‎requirements.txt
Lines changed: 1 addition & 0 deletions
diff --git a/‎tests/conftest.py
Lines changed: 0 additions & 3 deletions b/‎tests/conftest.py
Lines changed: 0 additions & 3 deletions
diff --git a/‎vllm_ascend/__init__.py
Lines changed: 4 additions & 0 deletions b/‎vllm_ascend/__init__.py
Lines changed: 4 additions & 0 deletions
diff --git a/‎vllm_ascend/distributed/__init__.py
Lines changed: 6 additions & 0 deletions b/‎vllm_ascend/distributed/__init__.py
Lines changed: 6 additions & 0 deletions
@@ -46,11 +46,23 @@ jobs:
       max-parallel: 2
       matrix:
         os: [linux-arm64-npu-1, linux-arm64-npu-4]
-        vllm_verison: [main, v0.8.3]
+        vllm_verison: [main, v0.8.4]
+    concurrency:
+      group: >
+        ${{ 
+        matrix.os == 'linux-arm64-npu-4' 
+          && github.event.pull_request.number 
+          && format('pr-{0}-limit-npu-4', github.event.pull_request.number) 
+        || format('job-{0}-{1}-{2}', matrix.os, matrix.vllm_verison, github.event.pull_request.number) 
+        }}
+      cancel-in-progress: false
     name: vLLM Ascend test
     runs-on: ${{ matrix.os }}
     container:
       image: quay.io/ascend/cann:8.0.0-910b-ubuntu22.04-py3.10
+      env:
+        HF_ENDPOINT: https://hf-mirror.com
+        HF_TOKEN: ${{ secrets.HF_TOKEN }}
     steps:
       - name: Check npu and CANN info
         run: |
@@ -108,7 +120,6 @@ jobs:
       - name: Run vllm-project/vllm-ascend test on V0 engine
         env:
           VLLM_USE_V1: 0
-          HF_ENDPOINT: https://hf-mirror.com
         run: |
           if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
             pytest -sv tests/singlecard
@@ -122,7 +133,6 @@ jobs:
         env:
           VLLM_USE_V1: 1
           VLLM_WORKER_MULTIPROC_METHOD: spawn
-          HF_ENDPOINT: https://hf-mirror.com
         run: |
           if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
             pytest -sv tests/singlecard
@@ -136,6 +146,5 @@ jobs:
         env:
           VLLM_USE_V1: 0
           PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
-          HF_ENDPOINT: https://hf-mirror.com
         run: |
           pytest -sv
@@ -55,7 +55,7 @@ After configuration, you can download our container from `m.daocloud.io/quay.io/
 
 ### 3. What models does vllm-ascend supports?
 
-Currently, we have already fully tested and supported `Qwen` / `Deepseek` (V0 only) / `Llama` models, other models we have tested are shown [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html). Plus, accoding to users' feedback, `gemma3` and `glm4` are not supported yet. Besides, more models need test.
+Currently, we have already fully tested and supported `Qwen` / `Deepseek` (V0 only) / `Llama` models, other models we have tested are shown [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html). Plus, according to users' feedback, `gemma3` and `glm4` are not supported yet. Besides, more models need test.
 
 ### 4. How to get in touch with our community?
 
@@ -69,3 +69,51 @@ There are many channels that you can communicate with our community developers /
 ### 5. What features does vllm-ascend V1 supports?
 
 Find more details [<u>here</u>](https://github.com/vllm-project/vllm-ascend/issues/414).
+
+### 6. How to solve the problem of "Failed to infer device type" or "libatb.so: cannot open shared object file"?
+
+Basically, the reason is that the NPU environment is not configured correctly. You can:
+1. try `source /usr/local/Ascend/nnal/atb/set_env.sh` to enable NNAL package.
+2. try `source /usr/local/Ascend/ascend-toolkit/set_env.sh` to enable CANN package.
+3. try `npu-smi info` to check whether the NPU is working.
+
+If all above steps are not working, you can try the following code with python to check whether there is any error:
+
+```
+import torch
+import torch_npu
+import vllm
+```
+
+If all above steps are not working, feel free to submit a GitHub issue.
+
+### 7. Does vllm-ascend support Atlas 300I Duo?
+
+No, vllm-ascend now only supports Atlas A2 series. We are working on it.
+
+### 8. How does vllm-ascend perform?
+
+Currently, only some models are improved. Such as `Qwen2 VL`, `Deepseek  V3`. Others are not good enough. In the future, we will support graph mode and custom ops to improve the performance of vllm-ascend. And when the official release of vllm-ascend is released, you can install `mindie-turbo` with `vllm-ascend` to speed up the inference as well.
+
+### 9. How vllm-ascend work with vllm?
+vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
+
+### 10. Does vllm-ascend support Prefill Disaggregation feature?
+
+Currently, only 1P1D is supported by vllm. For vllm-ascend, it'll be done by [this PR](https://github.com/vllm-project/vllm-ascend/pull/432). For NPND, vllm is not stable and fully supported yet. We will make it stable and supported by vllm-ascend in the future.
+
+### 11. Does vllm-ascend support quantization method?
+
+Currently, there is no quantization method supported in vllm-ascend originally. And the quantization supported is working in progress, w8a8 will firstly be supported.
+
+### 12. How to run w8a8 DeepSeek model?
+
+Currently, running on v0.7.3, we should run w8a8 with vllm + vllm-ascend + mindie-turbo. And we only need vllm + vllm-ascend when v0.8.X is released. After installing the above packages, you can follow the steps below to run w8a8 DeepSeek:
+
+1. Quantize bf16 DeepSeek, e.g. [unsloth/DeepSeek-R1-BF16](https://modelscope.cn/models/unsloth/DeepSeek-R1-BF16), with msModelSlim to get w8a8 DeepSeek. Find more details in [msModelSlim doc](https://gitee.com/ascend/msit/tree/master/msmodelslim/msmodelslim/pytorch/llm_ptq)
+2. Copy the content of `quant_model_description_w8a8_dynamic.json` into the `quantization_config` of `config.json` of the quantized model files.
+3. Reference with the quantized DeepSeek model.
+
+### 13. There is not output in log when loading models using vllm-ascend, How to solve it?
+
+If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.
@@ -45,6 +45,7 @@ faqs
 :maxdepth: 1
 user_guide/suppoted_features
 user_guide/supported_models
+user_guide/env_vars
 user_guide/release_notes
 :::
 
 
@@ -0,0 +1,9 @@
+# Environment Variables
+
+vllm-ascend uses the following environment variables to configure the system:
+
+:::{literalinclude} ../../../vllm_ascend/envs.py
+:language: python
+:start-after: begin-env-vars-definition
+:end-before: end-env-vars-definition
+:::
@@ -0,0 +1,140 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+# Adapted from vllm-project/vllm/examples/offline_inference/basic.py
+# Copyright 2023 The vLLM team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import multiprocessing as mp
+import os
+import time
+from multiprocessing import Event, Process
+
+
+def clean_up():
+    import gc
+
+    import torch
+    from vllm.distributed.parallel_state import (
+        destroy_distributed_environment, destroy_model_parallel)
+    destroy_model_parallel()
+    destroy_distributed_environment()
+    gc.collect()
+    torch.npu.empty_cache()
+
+
+def run_prefill(prefill_done, process_close):
+    os.environ["ASCEND_RT_VISIBLE_DEVICES"] = "0,1"
+
+    from vllm import LLM, SamplingParams
+    from vllm.config import KVTransferConfig
+
+    prompts = [
+        "Hello, how are you today?", "Hi, what is your name?",
+        "Tell me a very long story.", "what is your favourite book?"
+    ]
+    sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=1)
+
+    ktc = KVTransferConfig.from_cli(
+        '{"kv_connector":"AscendHcclConnector","kv_buffer_device":"npu","kv_role":"kv_producer", "kv_parallel_size":2}'
+    )
+
+    # Set NPU memory utilization to 0.8
+    llm = LLM(model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
+              kv_transfer_config=ktc,
+              max_model_len=2000,
+              gpu_memory_utilization=0.8,
+              tensor_parallel_size=2)
+
+    llm.generate(prompts, sampling_params)
+    print("Prefill node is finished.")
+    prefill_done.set()
+
+    # To keep the prefill node running in case the decode node is not done
+    # otherwise, the script might exit prematurely, causing incomplete decoding.
+    try:
+        while not process_close.is_set():
+            time.sleep(1)
+    except KeyboardInterrupt:
+        print("Script stopped by user.")
+    finally:
+        print("Cleanup prefill resources")
+        del llm
+        clean_up()
+
+
+def run_decode(prefill_done):
+    os.environ["ASCEND_RT_VISIBLE_DEVICES"] = "2,3"
+
+    from vllm import LLM, SamplingParams
+    from vllm.config import KVTransferConfig
+
+    prompts = [
+        "Hello, how are you today?", "Hi, what is your name?",
+        "Tell me a very long story.", "what is your favourite book?"
+    ]
+    sampling_params = SamplingParams(temperature=0, top_p=0.95)
+
+    ktc = KVTransferConfig.from_cli(
+        '{"kv_connector":"AscendHcclConnector","kv_buffer_device":"npu","kv_role":"kv_consumer","kv_parallel_size":2}'
+    )
+
+    llm = LLM(model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
+              kv_transfer_config=ktc,
+              max_model_len=2000,
+              gpu_memory_utilization=0.8,
+              tensor_parallel_size=2)
+
+    # Wait for the producer to start the consumer
+    print("Waiting for prefill node to finish...")
+    prefill_done.wait()
+
+    # At this point when the prefill_done is set, the kv-cache should have been
+    # transferred to this decode node, so we can start decoding.
+    outputs = llm.generate(prompts, sampling_params)
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+    del llm
+    clean_up()
+
+
+if __name__ == "__main__":
+    mp.get_context('spawn')
+
+    prefill_done = Event()
+    process_close = Event()
+    prefill_process = Process(target=run_prefill,
+                              args=(
+                                  prefill_done,
+                                  process_close,
+                              ))
+    decode_process = Process(target=run_decode, args=(prefill_done, ))
+
+    # Start prefill node
+    prefill_process.start()
+
+    # Start decode node
+    decode_process.start()
+
+    # Terminate the prefill node when decode is finished
+    decode_process.join()
+
+    # Terminate prefill process
+    process_close.set()
+    prefill_process.join()
+    prefill_process.terminate()
+    print("All process done!")
@@ -4,6 +4,7 @@ requires = [
     "cmake>=3.26",
     "decorator",
     "numpy<2.0.0",
+    "packaging",
     "pip",
     "pybind11",
     "pyyaml",
 
@@ -2,6 +2,7 @@
 cmake>=3.26
 decorator
 numpy<2.0.0
+packaging
 pybind11
 pyyaml
 scipy
 
@@ -29,16 +29,13 @@
 from vllm.distributed.parallel_state import (destroy_distributed_environment,
                                              destroy_model_parallel)
 from vllm.inputs import ExplicitEncoderDecoderPrompt, TextPrompt, TokensPrompt
-from vllm.logger import init_logger
 from vllm.outputs import RequestOutput
 from vllm.sampling_params import BeamSearchParams
 from vllm.utils import is_list_of
 
 from tests.model_utils import (TokensTextLogprobs,
                                TokensTextLogprobsPromptLogprobs)
 
-logger = init_logger(__name__)
-
 _M = TypeVar("_M")
 
 _PromptMultiModalInput = Union[List[_M], List[List[_M]]]
 
@@ -18,6 +18,10 @@
 
 def register():
     """Register the NPU platform."""
+    # Adapt the global patch here.
+    from vllm_ascend.utils import adapt_patch
+    adapt_patch(is_global_patch=True)
+
     return "vllm_ascend.platform.NPUPlatform"
 
 
 
@@ -0,0 +1,6 @@
+from vllm.distributed.kv_transfer.kv_connector.factory import \
+    KVConnectorFactory
+
+KVConnectorFactory.register_connector(
+    "AscendHcclConnector", "vllm_ascend.distributed.llmdatadist_connector",
+    "LLMDataDistConnector")