[XPU] Supports TP4 deployment on 4,5,6,7 (#2794)

yulangz · web-flow · commit 830de5a9259d · 2025-07-10T16:48:08.000+08:00
* 支持通过 XPU_VISIBLE_DEVICES 指定 4,5,6,7 卡运行
* 修改 XPU 文档中多卡说明
diff --git a/docs/get_started/installation/kunlunxin_xpu.md b/docs/get_started/installation/kunlunxin_xpu.md
@@ -156,7 +156,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
 **Deploy the ERNIE-4.5-300B-A47B-Paddle model with WINT4 precision and 32K context length on 4 XPUs**
 
 ```bash
-export XPU_VISIBLE_DEVICES="0,1,2,3"
+export XPU_VISIBLE_DEVICES="0,1,2,3" # Specify which cards to be used
 python -m fastdeploy.entrypoints.openai.api_server \
     --model baidu/ERNIE-4.5-300B-A47B-Paddle \
     --port 8188 \
@@ -167,6 +167,11 @@ python -m fastdeploy.entrypoints.openai.api_server \
     --gpu-memory-utilization 0.9
 ```
 
+**Note:** When deploying on 4 XPUs, only two configurations are supported which constrained by hardware limitations such as interconnect capabilities.
+`export XPU_VISIBLE_DEVICES="0,1,2,3"`
+or
+`export XPU_VISIBLE_DEVICES="4,5,6,7"`
+
 Refer to [Parameters](../../parameters.md) for more options.
 
 #### Send requests
diff --git a/docs/zh/get_started/installation/kunlunxin_xpu.md b/docs/zh/get_started/installation/kunlunxin_xpu.md
@@ -157,7 +157,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
 **基于 WINT4 精度和 32K 上下文部署 ERNIE-4.5-300B-A47B-Paddle 模型到 4 卡 P800 服务器**
 
 ```bash
-export XPU_VISIBLE_DEVICES="0,1,2,3"
+export XPU_VISIBLE_DEVICES="0,1,2,3" # 设置使用的 XPU 卡
 python -m fastdeploy.entrypoints.openai.api_server \
     --model baidu/ERNIE-4.5-300B-A47B-Paddle \
     --port 8188 \
@@ -168,6 +168,11 @@ python -m fastdeploy.entrypoints.openai.api_server \
     --gpu-memory-utilization 0.9
 ```
 
+**注意：** 使用 P800 在 4 块 XPU 上进行部署时，由于受到卡间互联拓扑等硬件限制，仅支持以下两种配置方式：
+`export XPU_VISIBLE_DEVICES="0,1,2,3"`
+or
+`export XPU_VISIBLE_DEVICES="4,5,6,7"`
+
 更多参数可以参考 [参数说明](../../parameters.md)。
 
 #### 请求服务
diff --git a/fastdeploy/engine/config.py b/fastdeploy/engine/config.py
@@ -686,6 +686,8 @@ def __init__(
         self.engine_worker_queue_port = engine_worker_queue_port
         self.device_ids = ",".join([str(i) for i in range(self.worker_num_per_node)])
         self.device_ids = os.getenv("CUDA_VISIBLE_DEVICES", self.device_ids)
+        if current_platform.is_xpu():
+            self.device_ids = os.getenv("XPU_VISIBLE_DEVICES", self.device_ids)
 
         self.enable_logprob = enable_logprob