Skip to content

Commit 0c4aa2b

Browse files
authored
[Doc] Add multi node data parallel doc (#1685)
### What this PR does / why we need it? add multi node data parallel doc ### Does this PR introduce _any_ user-facing change? add multi node data parallel doc ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: vllm-project/vllm@805d62c Signed-off-by: wangli <wangli858794774@gmail.com>
1 parent b4b19ea commit 0c4aa2b

File tree

2 files changed

+112
-109
lines changed

2 files changed

+112
-109
lines changed

docs/source/assets/multi_node_dp.png

115 KB
Loading

docs/source/tutorials/multi_node.md

Lines changed: 112 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,19 @@
1-
# Multi-Node (DeepSeek)
1+
# Multi-Node-DP (DeepSeek)
22

3-
Multi-node inference is suitable for scenarios where the model cannot be deployed on a single NPU. In such cases, the model can be distributed using tensor parallelism and pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
3+
## Getting Start
4+
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
45

5-
* **Verify Multi-Node Communication Environment**
6-
* **Set Up and Start the Ray Cluster**
7-
* **Start the Online Inference Service on multinode**
6+
Each DP rank is deployed as a separate “core engine” process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
87

8+
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
9+
- Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
10+
- Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
11+
12+
This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
13+
14+
In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
15+
16+
For MoE models, when any requests are in progress in any rank, we must ensure that empty “dummy” forward passes are performed in all ranks which don’t currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
917

1018
## Verify Multi-Node Communication Environment
1119

@@ -45,24 +53,19 @@ for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
4553
hccn_tool -i 0 -ping -g address 10.20.0.20
4654
```
4755

48-
## Set Up and Start the Ray Cluster
49-
### Setting Up the Basic Container
50-
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is recommended to use Docker images.
51-
52-
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the master and worker nodes, with the `--net=host` option to enable proper network connectivity.
53-
54-
Below is the example container setup command, which should be executed on **all nodes** :
55-
56-
56+
## Run with docker
57+
Assume you have two Altas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3-w8a8` quantitative model across multi-node.
5758

5859
```shell
5960
# Define the image and container name
60-
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
61+
export IMAGE=quay.io/ascend/vllm-ascend:main
6162
export NAME=vllm-ascend
6263

6364
# Run the container using the defined variables
65+
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
6466
docker run --rm \
6567
--name $NAME \
68+
--net=host \
6669
--device /dev/davinci0 \
6770
--device /dev/davinci1 \
6871
--device /dev/davinci2 \
@@ -75,121 +78,121 @@ docker run --rm \
7578
--device /dev/devmm_svm \
7679
--device /dev/hisi_hdc \
7780
-v /usr/local/dcmi:/usr/local/dcmi \
81+
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
7882
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
7983
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
8084
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
8185
-v /etc/ascend_install.info:/etc/ascend_install.info \
82-
-v /root/.cache:/root/.cache \
83-
-p 8000:8000 \
86+
-v /mnt/sfs_turbo/.cache:/root/.cache \
8487
-it $IMAGE bash
8588
```
8689

87-
### Start Ray Cluster
88-
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
89-
90-
Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
91-
92-
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues. The `--num-gpus` parameter defines the number of NPUs to be used on each node.
93-
94-
Below are the commands for the head and worker nodes:
95-
96-
**Head node**:
97-
9890
:::{note}
99-
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect.
100-
Updating the environment variables requires restarting the Ray cluster.
101-
:::
102-
103-
```shell
104-
# Head node
105-
export HCCL_IF_IP={local_ip}
106-
export GLOO_SOCKET_IFNAME={nic_name}
107-
export TP_SOCKET_IFNAME={nic_name}
108-
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
109-
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
110-
ray start --head --num-gpus=8
111-
```
112-
**Worker node**:
113-
114-
:::{note}
115-
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
116-
:::
117-
118-
```shell
119-
# Worker node
120-
export HCCL_IF_IP={local_ip}
121-
export GLOO_SOCKET_IFNAME={nic_name}
122-
export TP_SOCKET_IFNAME={nic_name}
123-
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
124-
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
125-
ray start --address='{head_node_ip}:{port_num}' --num-gpus=8 --node-ip-address={local_ip}
126-
```
127-
:::{tip}
128-
Before starting the Ray cluster, set the `export ASCEND_PROCESS_LOG_PATH={plog_save_path}` environment variable on each node to redirect the Ascend plog, which helps in debugging issues during multi-node execution.
91+
Before launch the inference server, ensure some environment variables are set for multi node communication
12992
:::
13093

94+
Run the following scripts on two nodes respectively
13195

132-
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
133-
134-
135-
## Start the Online Inference Service on multinode
136-
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster. You only need to run the vllm command on one node.
137-
138-
To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
139-
140-
For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
141-
96+
**node0**
14297
```shell
143-
python -m vllm.entrypoints.openai.api_server \
144-
--model="Deepseek/DeepSeek-V2-Lite-Chat" \
145-
--trust-remote-code \
146-
--enforce-eager \
147-
--distributed_executor_backend "ray" \
148-
--tensor-parallel-size 8 \
149-
--pipeline-parallel-size 2 \
150-
--disable-frontend-multiprocessing \
151-
--port {port_num}
98+
#!/bin/sh
99+
100+
# this obtained through ifconfig
101+
# nic_name is the network interface name corresponding to local_ip
102+
nic_name="xxxx"
103+
local_ip="xxxx"
104+
105+
export HCCL_IF_IP=$local_ip
106+
export GLOO_SOCKET_IFNAME=$nic_name
107+
export TP_SOCKET_IFNAME=$nic_name
108+
export HCCL_SOCKET_IFNAME=$nic_name
109+
export OMP_PROC_BIND=false
110+
export OMP_NUM_THREADS=100
111+
export VLLM_USE_V1=1
112+
export HCCL_BUFFSIZE=1024
113+
114+
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8
115+
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/quantization.html
116+
vllm serve /root/.cache/ds_v3 \
117+
--host 0.0.0.0 \
118+
--port 8004 \
119+
--data-parallel-size 4 \
120+
--data-parallel-size-local 2 \
121+
--data-parallel-address $local_ip \
122+
--data-parallel-rpc-port 13389 \
123+
--tensor-parallel-size 4 \
124+
--seed 1024 \
125+
--served-model-name deepseek_v3 \
126+
--enable-expert-parallel \
127+
--max-num-seqs 16 \
128+
--max-model-len 32768 \
129+
--quantization ascend \
130+
--max-num-batched-tokens 4096 \
131+
--trust-remote-code \
132+
--no-enable-prefix-caching \
133+
--gpu-memory-utilization 0.9 \
134+
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
152135
```
153-
:::{note}
154-
Pipeline parallelism currently requires AsyncLLMEngine, hence the `--disable-frontend-multiprocessing` is set.
155-
:::
156136

157-
Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
137+
**node1**
158138
```shell
159-
python -m vllm.entrypoints.openai.api_server \
160-
--model="Deepseek/DeepSeek-V2-Lite-Chat" \
161-
--trust-remote-code \
162-
--distributed_executor_backend "ray" \
163-
--enforce-eager \
164-
--tensor-parallel-size 16 \
165-
--port {port_num}
139+
#!/bin/sh
140+
141+
nic_name="xxx"
142+
local_ip="xxx"
143+
144+
export HCCL_IF_IP=$local_ip
145+
export GLOO_SOCKET_IFNAME=$nic_name
146+
export TP_SOCKET_IFNAME=$nic_name
147+
export HCCL_SOCKET_IFNAME=$nic_name
148+
export OMP_PROC_BIND=false
149+
export OMP_NUM_THREADS=100
150+
export VLLM_USE_V1=1
151+
export HCCL_BUFFSIZE=1024
152+
153+
vllm serve /root/.cache/ds_v3 \
154+
--host 0.0.0.0 \
155+
--port 8004 \
156+
--headless \
157+
--data-parallel-size 4 \
158+
--data-parallel-size-local 2 \
159+
--data-parallel-start-rank 2 \
160+
--data-parallel-address { node0 ip } \
161+
--data-parallel-rpc-port 13389 \
162+
--tensor-parallel-size 4 \
163+
--seed 1024 \
164+
--quantization ascend \
165+
--served-model-name deepseek_v3 \
166+
--max-num-seqs 16 \
167+
--max-model-len 32768 \
168+
--max-num-batched-tokens 4096 \
169+
--enable-expert-parallel \
170+
--trust-remote-code \
171+
--no-enable-prefix-caching \
172+
--gpu-memory-utilization 0.92 \
173+
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
166174
```
167175
168-
:::{note}
169-
If you're running DeepSeek V3/R1, please remove `quantization_config` section in `config.json` file since it's not supported by vllm-ascend currently.
170-
:::
176+
The Deployment view looks like:
177+
![alt text](../assets/multi_node_dp.png)
171178
172179
Once your server is started, you can query the model with input prompts:
173180
174181
```shell
175-
curl -X POST http://127.0.0.1:{prot_num}/v1/completions \
176-
-H "Content-Type: application/json" \
177-
-d '{
178-
"model": "Deepseek/DeepSeek-V2-Lite-Chat",
179-
"prompt": "The future of AI is",
180-
"max_tokens": 24
181-
}'
182+
curl http://{ node0 ip:8004 }/v1/completions \
183+
-H "Content-Type: application/json" \
184+
-d '{
185+
"model": "/root/.cache/ds_v3",
186+
"prompt": "The future of AI is",
187+
"max_tokens": 50,
188+
"temperature": 0
189+
}'
182190
```
183191
184-
If you query the server successfully, you can see the info shown below (client):
185-
186-
```
187-
{"id":"cmpl-6dfb5a8d8be54d748f0783285dd52303","object":"text_completion","created":1739957835,"model":"/home/data/DeepSeek-V2-Lite-Chat/","choices":[{"index":0,"text":" heavily influenced by neuroscience and cognitiveGuionistes. The goalochondria is to combine the efforts of researchers, technologists,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":30,"completion_tokens":24,"prompt_tokens_details":null}}
188-
```
189-
190-
Logs of the vllm server:
191-
192-
```
193-
INFO: 127.0.0.1:59384 - "POST /v1/completions HTTP/1.1" 200 OK
194-
INFO 02-19 17:37:35 metrics.py:453 Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, NPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
192+
## Run benchmarks
193+
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
194+
```shell
195+
vllm bench serve --model /root/.cache/ds_v3 --served-model-name deepseek_v3 \
196+
--dataset-name random --random-input-len 128 --random-output-len 128 \
197+
--num-prompts 200 --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1
195198
```

0 commit comments

Comments
 (0)