Skip to content

Commit 1ce41a7

Browse files
ttanzhiqiangwangxiaoxin (A)
authored andcommitted
etp best a2 (#1101)
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + #910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] #1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
1 parent 1288956 commit 1ce41a7

File tree

2 files changed

+79
-0
lines changed

2 files changed

+79
-0
lines changed

examples/run_dp_attention_etp16.sh

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
export VLLM_ENABLE_MC2=0
2+
export VLLM_USE_V1=1
3+
export TASK_QUEUE_ENABLE=1
4+
source /usr/local/Ascend/ascend-toolkit/set_env.sh
5+
source /usr/local/Ascend/nnal/atb/set_env.sh
6+
export ASCEND_LAUNCH_BLOCKING=0
7+
export VLLM_VERSION=0.9.0
8+
9+
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
10+
--quantization ascend \
11+
--trust-remote-code \
12+
--distributed-executor-backend=mp \
13+
--port 8006 \
14+
-tp=8 \
15+
-dp=2 \
16+
--max-num-seqs 24 \
17+
--max-model-len 32768 \
18+
--max-num-batched-tokens 32768 \
19+
--block-size 128 \
20+
--no-enable-prefix-caching \
21+
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \
22+
--gpu-memory-utilization 0.96 &> run.log &
23+
disown
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
#!/bin/bash
2+
# Concurrency array
3+
concurrency_array=(48)
4+
#best rate
5+
rate_array=(0.7)
6+
7+
# Result file
8+
result_file="benchmark_results.txt"
9+
echo "Benchmark Results" > $result_file
10+
echo "===================" >> $result_file
11+
12+
# Loop through all combinations
13+
for concurrency in "${concurrency_array[@]}"; do
14+
for rate in "${rate_array[@]}"; do
15+
echo "Testing with concurrency=$concurrency, rate=$rate"
16+
echo "" >> $result_file
17+
echo "Concurrency: $concurrency, Request Rate: $rate" >> $result_file
18+
echo "-------------------" >> $result_file
19+
20+
# Run benchmark test
21+
python /mnt/deepseek/vllm/benchmarks/benchmark_serving.py \
22+
--backend vllm \
23+
--trust-remote-code \
24+
--model /mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
25+
--dataset-name random \
26+
--random-input-len 4096 \
27+
--random-output-len 1536 \
28+
--ignore-eos \
29+
--num-prompts 400 \
30+
--max-concurrency $concurrency \
31+
--request-rate $rate \
32+
--metric-percentiles 90 \
33+
--base-url http://localhost:8006 2>&1 | tee -a $result_file
34+
35+
# Wait for system cool down
36+
sleep 30
37+
done
38+
done
39+
40+
# Analyze results
41+
echo "Analysis Results" > analysis_results.txt
42+
echo "=================" >> analysis_results.txt
43+
44+
# Extract and analyze TPOT data
45+
echo "TPOT Analysis:" >> analysis_results.txt
46+
grep "Mean TPOT" $result_file | awk -F':' '{
47+
printf "Concurrency %s, Rate %s: %s ms\n", $1, $2, $3
48+
}' >> analysis_results.txt
49+
50+
# Extract and analyze throughput data
51+
echo -e "\nThroughput Analysis:" >> analysis_results.txt
52+
grep "Output token throughput" $result_file | awk -F':' '{
53+
printf "Concurrency %s, Rate %s: %s tokens/s\n", $1, $2, $3
54+
}' >> analysis_results.txt
55+
56+
echo "Testing completed. Results saved in $result_file and analysis in analysis_results.txt"

0 commit comments

Comments
 (0)