Skip to content

Commit 5d62393

Browse files
authored
[DOC] Update multi_node.md (#468)
### What this PR does / why we need it? - Added instructions for verifying multi-node communication environment. - Included explanations of Ray-related environment variables for configuration. - Provided detailed steps for launching services in a multi-node environment. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually tested. Signed-off-by: jinyuxin <jinyuxin2@huawei.com>
1 parent f6cf92e commit 5d62393

File tree

1 file changed

+116
-34
lines changed

1 file changed

+116
-34
lines changed

docs/source/tutorials/multi_node.md

Lines changed: 116 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,71 @@
11
# Multi-Node (DeepSeek)
22

3-
## Online Serving on Multi node
3+
Multi-node inference is suitable for scenarios where the model cannot be deployed on a single NPU. In such cases, the model can be distributed using tensor parallelism and pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
44

5-
Run docker container on each machine:
5+
* **Verify Multi-Node Communication Environment**
6+
* **Set Up and Start the Ray Cluster**
7+
* **Start the Online Inference Service on multinode**
68

7-
```{code-block} bash
8-
:substitutions:
99

10+
## Verify Multi-Node Communication Environment
11+
12+
### Physical Layer Requirements:
13+
14+
- The physical machines must be located on the same WLAN, with network connectivity.
15+
- All NPUs are connected with optical modules, and the connection status must be normal.
16+
17+
### Verification Process:
18+
19+
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
20+
21+
```bash
22+
# Check the remote switch ports
23+
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
24+
# Get the link status of the Ethernet ports (UP or DOWN)
25+
for i in {0..7}; do hccn_tool -i $i -link -g ; done
26+
# Check the network health status
27+
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
28+
# View the network detected IP configuration
29+
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
30+
# View gateway configuration
31+
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
32+
# View NPU network configuration
33+
cat /etc/hccn.conf
34+
```
35+
36+
### NPU Interconnect Verification:
37+
#### 1. Get NPU IP Addresses
38+
```bash
39+
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
40+
```
41+
42+
#### 2. Cross-Node PING Test
43+
```bash
44+
# Execute on the target node (replace with actual IP)
45+
hccn_tool -i 0 -ping -g address 10.20.0.20
46+
```
47+
48+
## Set Up and Start the Ray Cluster
49+
### Setting Up the Basic Container
50+
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is recommended to use Docker images.
51+
52+
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the master and worker nodes, with the `--net=host` option to enable proper network connectivity.
53+
54+
Below is the example container setup command, which should be executed on **all nodes** :
55+
56+
57+
58+
```shell
59+
# Define the image and container name
60+
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
61+
export NAME=vllm-ascend
62+
63+
# Run the container using the defined variables
1064
docker run --rm \
11-
--name vllm-ascend \
65+
--name $NAME \
1266
--device /dev/davinci0 \
1367
--device /dev/davinci1 \
14-
--device /dev/davinci2\
68+
--device /dev/davinci2 \
1569
--device /dev/davinci3 \
1670
--device /dev/davinci4 \
1771
--device /dev/davinci5 \
@@ -27,65 +81,93 @@ docker run --rm \
2781
-v /etc/ascend_install.info:/etc/ascend_install.info \
2882
-v /root/.cache:/root/.cache \
2983
-p 8000:8000 \
30-
-it quay.io/ascend/vllm-ascend:|vllm_ascend_version| bash
84+
-it $IMAGE bash
3185
```
3286

33-
Choose one machine as head node, the other are worker nodes, then start ray on each machine:
87+
### Start Ray Cluster
88+
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
89+
90+
Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
91+
92+
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues. The `--num-gpus` parameter defines the number of NPUs to be used on each node.
93+
94+
Below are the commands for the head and worker nodes:
95+
96+
**Head node**:
3497

3598
:::{note}
36-
Check out your `nic_name` by command `ip addr`.
99+
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect.
100+
Updating the environment variables requires restarting the Ray cluster.
37101
:::
38102

39103
```shell
40104
# Head node
41105
export HCCL_IF_IP={local_ip}
42106
export GLOO_SOCKET_IFNAME={nic_name}
43107
export TP_SOCKET_IFNAME={nic_name}
44-
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
45108
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
109+
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
46110
ray start --head --num-gpus=8
111+
```
112+
**Worker node**:
47113

114+
:::{note}
115+
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
116+
:::
117+
118+
```shell
48119
# Worker node
49120
export HCCL_IF_IP={local_ip}
50-
export ASCEND_PROCESS_LOG_PATH={plog_save_path}
51121
export GLOO_SOCKET_IFNAME={nic_name}
52122
export TP_SOCKET_IFNAME={nic_name}
53123
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
54124
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
55125
ray start --address='{head_node_ip}:{port_num}' --num-gpus=8 --node-ip-address={local_ip}
56126
```
57-
58-
:::{note}
59-
If you're running DeepSeek V3/R1, please remove `quantization_config` section in `config.json` file since it's not supported by vllm-ascend currently.
127+
:::{tip}
128+
Before starting the Ray cluster, set the `export ASCEND_PROCESS_LOG_PATH={plog_save_path}` environment variable on each node to redirect the Ascend plog, which helps in debugging issues during multi-node execution.
60129
:::
61130

62-
Start the vLLM server on head node:
131+
132+
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
133+
134+
135+
## Start the Online Inference Service on multinode
136+
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster. You only need to run the vllm command on one node.
137+
138+
To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
139+
140+
For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
63141

64142
```shell
65-
export VLLM_HOST_IP={head_node_ip}
66-
export HCCL_CONNECT_TIMEOUT=120
67-
export ASCEND_PROCESS_LOG_PATH={plog_save_path}
68-
export HCCL_IF_IP={head_node_ip}
69-
70-
if [ -d "{plog_save_path}" ]; then
71-
rm -rf {plog_save_path}
72-
echo ">>> remove {plog_save_path}"
73-
fi
74-
75-
LOG_FILE="multinode_$(date +%Y%m%d_%H%M).log"
76-
VLLM_TORCH_PROFILER_DIR=./vllm_profile
77-
python -m vllm.entrypoints.openai.api_server \
143+
python -m vllm.entrypoints.openai.api_server \
78144
--model="Deepseek/DeepSeek-V2-Lite-Chat" \
79145
--trust-remote-code \
80146
--enforce-eager \
81-
--max-model-len {max_model_len} \
82147
--distributed_executor_backend "ray" \
83-
--tensor-parallel-size 16 \
84-
--disable-log-requests \
85-
--disable-log-stats \
148+
--tensor-parallel-size 8 \
149+
--pipeline-parallel-size 2 \
86150
--disable-frontend-multiprocessing \
87-
--port {port_num} \
151+
--port {port_num}
88152
```
153+
:::{note}
154+
Pipeline parallelism currently requires AsyncLLMEngine, hence the `--disable-frontend-multiprocessing` is set.
155+
:::
156+
157+
Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
158+
```shell
159+
python -m vllm.entrypoints.openai.api_server \
160+
--model="Deepseek/DeepSeek-V2-Lite-Chat" \
161+
--trust-remote-code \
162+
--distributed_executor_backend "ray" \
163+
--enforce-eager \
164+
--tensor-parallel-size 16 \
165+
--port {port_num}
166+
```
167+
168+
:::{note}
169+
If you're running DeepSeek V3/R1, please remove `quantization_config` section in `config.json` file since it's not supported by vllm-ascend currentlly.
170+
:::
89171

90172
Once your server is started, you can query the model with input prompts:
91173

@@ -109,5 +191,5 @@ Logs of the vllm server:
109191

110192
```
111193
INFO: 127.0.0.1:59384 - "POST /v1/completions HTTP/1.1" 200 OK
112-
INFO 02-19 17:37:35 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
194+
INFO 02-19 17:37:35 metrics.py:453 Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, NPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
113195
```

0 commit comments

Comments
 (0)