You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### What this PR does / why we need it?
add multi node data parallel doc
### Does this PR introduce _any_ user-facing change?
add multi node data parallel doc
### How was this patch tested?
- vLLM version: v0.9.1
- vLLM main:
vllm-project/vllm@805d62c
Signed-off-by: wangli <wangli858794774@gmail.com>
Multi-node inference is suitable for scenarios where the model cannot be deployed on a single NPU. In such cases, the model can be distributed using tensor parallelism and pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
3
+
## Getting Start
4
+
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
4
5
5
-
***Verify Multi-Node Communication Environment**
6
-
***Set Up and Start the Ray Cluster**
7
-
***Start the Online Inference Service on multinode**
6
+
Each DP rank is deployed as a separate “core engine” process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
8
7
8
+
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
9
+
- Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
10
+
- Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
11
+
12
+
This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
13
+
14
+
In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
15
+
16
+
For MoE models, when any requests are in progress in any rank, we must ensure that empty “dummy” forward passes are performed in all ranks which don’t currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
9
17
10
18
## Verify Multi-Node Communication Environment
11
19
@@ -45,24 +53,19 @@ for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
45
53
hccn_tool -i 0 -ping -g address 10.20.0.20
46
54
```
47
55
48
-
## Set Up and Start the Ray Cluster
49
-
### Setting Up the Basic Container
50
-
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is recommended to use Docker images.
51
-
52
-
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the master and worker nodes, with the `--net=host` option to enable proper network connectivity.
53
-
54
-
Below is the example container setup command, which should be executed on **all nodes** :
55
-
56
-
56
+
## Run with docker
57
+
Assume you have two Altas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3-w8a8` quantitative model across multi-node.
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
89
-
90
-
Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
91
-
92
-
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues. The `--num-gpus` parameter defines the number of NPUs to be used on each node.
93
-
94
-
Below are the commands for the head and worker nodes:
95
-
96
-
**Head node**:
97
-
98
90
:::{note}
99
-
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect.
100
-
Updating the environment variables requires restarting the Ray cluster.
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
ray start --address='{head_node_ip}:{port_num}' --num-gpus=8 --node-ip-address={local_ip}
126
-
```
127
-
:::{tip}
128
-
Before starting the Ray cluster, set the `export ASCEND_PROCESS_LOG_PATH={plog_save_path}` environment variable on each node to redirect the Ascend plog, which helps in debugging issues during multi-node execution.
91
+
Before launch the inference server, ensure some environment variables are set for multi node communication
129
92
:::
130
93
94
+
Run the following scripts on two nodes respectively
131
95
132
-
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
133
-
134
-
135
-
## Start the Online Inference Service on multinode
136
-
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster. You only need to run the vllm command on one node.
137
-
138
-
To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
139
-
140
-
For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
141
-
96
+
**node0**
142
97
```shell
143
-
python -m vllm.entrypoints.openai.api_server \
144
-
--model="Deepseek/DeepSeek-V2-Lite-Chat" \
145
-
--trust-remote-code \
146
-
--enforce-eager \
147
-
--distributed_executor_backend "ray" \
148
-
--tensor-parallel-size 8 \
149
-
--pipeline-parallel-size 2 \
150
-
--disable-frontend-multiprocessing \
151
-
--port {port_num}
98
+
#!/bin/sh
99
+
100
+
# this obtained through ifconfig
101
+
# nic_name is the network interface name corresponding to local_ip
102
+
nic_name="xxxx"
103
+
local_ip="xxxx"
104
+
105
+
export HCCL_IF_IP=$local_ip
106
+
export GLOO_SOCKET_IFNAME=$nic_name
107
+
export TP_SOCKET_IFNAME=$nic_name
108
+
export HCCL_SOCKET_IFNAME=$nic_name
109
+
export OMP_PROC_BIND=false
110
+
export OMP_NUM_THREADS=100
111
+
export VLLM_USE_V1=1
112
+
export HCCL_BUFFSIZE=1024
113
+
114
+
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8
115
+
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/quantization.html
Pipeline parallelism currently requires AsyncLLMEngine, hence the `--disable-frontend-multiprocessing` is set.
155
-
:::
156
136
157
-
Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
If you're running DeepSeek V3/R1, please remove `quantization_config` section in `config.json` file since it's not supported by vllm-ascend currently.
170
-
:::
176
+
The Deployment view looks like:
177
+

171
178
172
179
Once your server is started, you can query the model with input prompts:
173
180
174
181
```shell
175
-
curl -X POST http://127.0.0.1:{prot_num}/v1/completions \
176
-
-H "Content-Type: application/json" \
177
-
-d '{
178
-
"model": "Deepseek/DeepSeek-V2-Lite-Chat",
179
-
"prompt": "The future of AI is",
180
-
"max_tokens": 24
181
-
}'
182
+
curl http://{ node0 ip:8004 }/v1/completions \
183
+
-H "Content-Type: application/json" \
184
+
-d '{
185
+
"model": "/root/.cache/ds_v3",
186
+
"prompt": "The future of AI is",
187
+
"max_tokens": 50,
188
+
"temperature": 0
189
+
}'
182
190
```
183
191
184
-
If you query the server successfully, you can see the info shown below (client):
185
-
186
-
```
187
-
{"id":"cmpl-6dfb5a8d8be54d748f0783285dd52303","object":"text_completion","created":1739957835,"model":"/home/data/DeepSeek-V2-Lite-Chat/","choices":[{"index":0,"text":" heavily influenced by neuroscience and cognitiveGuionistes. The goalochondria is to combine the efforts of researchers, technologists,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":30,"completion_tokens":24,"prompt_tokens_details":null}}
188
-
```
189
-
190
-
Logs of the vllm server:
191
-
192
-
```
193
-
INFO: 127.0.0.1:59384 - "POST /v1/completions HTTP/1.1" 200 OK
0 commit comments