Skip to content

Commit a5f3359

Browse files
Angazennangazennqyqc731
authored
[CORE]initial support for torchair with non-mla backend (#1506)
### What this PR does / why we need it? This PR supports torchair graph mode with non-mla backend on both 800IA2 and 300I Duo platforms. The main change is to add `attention_v1_torchair.py` to support specific attention related operations that are required by torchair. ### Does this PR introduce _any_ user-facing change? Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we can also use it with pangu. Besides, we add a support model list to control which type of models that can use torchair. ### How was this patch tested? We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms, and model generates answer normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>
1 parent 9fbd801 commit a5f3359

File tree

19 files changed

+1130
-84
lines changed

19 files changed

+1130
-84
lines changed

.github/workflows/doc_codespell.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,6 @@ jobs:
2828
- name: Run codespell check
2929
run: |
3030
CODESPELL_EXCLUDES=('--skip' 'tests/prompts/**,./benchmarks/sonnet.txt,*tests/lora/data/**,build/**,./vllm_ascend.egg-info/**')
31-
CODESPELL_IGNORE_WORDS=('-L' 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn,assertIn')
31+
CODESPELL_IGNORE_WORDS=('-L' 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn,assertIn,rever')
3232
3333
codespell --toml pyproject.toml "${CODESPELL_EXCLUDES[@]}" "${CODESPELL_IGNORE_WORDS[@]}"

.github/workflows/vllm_ascend_test.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ jobs:
8686
- name: Run codespell check
8787
run: |
8888
CODESPELL_EXCLUDES=('--skip' 'tests/prompts/**,./benchmarks/sonnet.txt,*tests/lora/data/**,build/**,./vllm_ascend.egg-info/**')
89-
CODESPELL_IGNORE_WORDS=('-L' 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn,assertIn')
89+
CODESPELL_IGNORE_WORDS=('-L' 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn,assertIn,rever')
9090
9191
codespell --toml pyproject.toml "${CODESPELL_EXCLUDES[@]}" "${CODESPELL_IGNORE_WORDS[@]}"
9292
- name: Analysing the code with ruff

docs/source/user_guide/additional_config.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,14 +40,14 @@ The details of each config option are as follows:
4040

4141
| Name | Type | Default | Description |
4242
| ---- | ---- | ------- | ----------- |
43-
| `enabled` | bool | `False` | Whether to enable torchair graph mode |
44-
| `enable_multistream_mla`| bool | `False` | Whether to put vector ops of MLA to another stream |
45-
| `enable_multistream_moe`| bool | `False` | Whether to enable multistream shared expert |
43+
| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported to use torchair graph mode |
44+
| `enable_multistream_mla`| bool | `False` | Whether to put vector ops of MLA to another stream. This option only takes effects on models using MLA (e.g., DeepSeek). |
45+
| `enable_multistream_moe`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on DeepSeek moe models. |
4646
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
4747
| `use_cached_graph` | bool | `False` | Whether to use cached graph |
4848
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache |
4949
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty |
50-
| `enable_kv_nz`| bool | `False` | Whether to enable kvcache NZ layout |
50+
| `enable_kv_nz`| bool | `False` | Whether to enable kvcache NZ layout. This option only takes effects on models using MLA (e.g., DeepSeek). |
5151

5252
**ascend_scheduler_config**
5353

docs/source/user_guide/graph_mode.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by defa
1212

1313
There are two kinds for graph mode supported by vLLM Ascend:
1414
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, only Qwen series models are well tested.
15-
- **TorchAirGraph**: This is the GE graph mode. In v0.9.1rc1, only DeepSeek series models are supported.
15+
- **TorchAirGraph**: This is the GE graph mode. In v0.9.1rc1, only DeepSeek series models are supported. In v0.9.1rc2, we also support PanguProMoe with torchair.
1616

1717
## Using ACLGraph
1818
ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.

format.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,7 @@ CODESPELL_EXCLUDES=(
145145
)
146146

147147
CODESPELL_IGNORE_WORDS=(
148-
'-L' 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn,assertIn'
148+
'-L' 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn,assertIn,rever'
149149
)
150150

151151
# check spelling of specified files

tests/e2e/multicard/test_offline_inference_distributed.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,3 +165,20 @@ def test_models_distributed_DeepSeek_W8A8():
165165
quantization="ascend",
166166
) as vllm_model:
167167
vllm_model.generate_greedy(example_prompts, max_tokens)
168+
169+
170+
def test_models_distributed_pangu():
171+
example_prompts = [
172+
"Hello, my name is",
173+
]
174+
max_tokens = 5
175+
176+
with VllmRunner(
177+
snapshot_download("vllm-ascend/pangu-pro-moe-pruing"),
178+
max_model_len=8192,
179+
enforce_eager=True,
180+
dtype="auto",
181+
tensor_parallel_size=4,
182+
distributed_executor_backend="mp",
183+
) as vllm_model:
184+
vllm_model.generate_greedy(example_prompts, max_tokens)

tests/e2e/multicard/test_torchair_graph_mode.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,3 +99,63 @@ def test_e2e_deepseekv3_with_torchair_ms_mla():
9999
},
100100
}
101101
_deepseek_torchair_test_fixture(additional_config)
102+
103+
104+
def _pangu_torchair_test_fixture(
105+
additional_config: Dict,
106+
*,
107+
tensor_parallel_size=4,
108+
):
109+
example_prompts = [
110+
"Hello, my name is",
111+
"The president of the United States is",
112+
"The capital of France is",
113+
"The future of AI is",
114+
]
115+
116+
# torchair is only work without chunked-prefill now
117+
kwargs = {
118+
"ascend_scheduler_config": {
119+
"enabled": True,
120+
},
121+
"refresh": True,
122+
}
123+
additional_config.update(**kwargs)
124+
125+
with VllmRunner(
126+
"vllm-ascend/pangu-pro-moe-pruing",
127+
dtype="half",
128+
tensor_parallel_size=tensor_parallel_size,
129+
distributed_executor_backend="mp",
130+
enforce_eager=False,
131+
additional_config=additional_config,
132+
) as vllm_model:
133+
# use greedy sampler to make sure the generated results are fix
134+
vllm_output = vllm_model.generate_greedy(example_prompts, 5)
135+
136+
# NOTE: vllm-ascend/pangu-pro-moe-pruing is only part of PanguProMoE
137+
# with 2 hidden layers, thus the golden results seems inaccurate.
138+
# This will only change if accuracy changes with the official weights
139+
# of PanguProMoE.
140+
golden_results = [
141+
'Hello, my name is Remempondeprecatedmiot忱',
142+
'The president of the United States is Remem下的一个 rever ceremoni Segnali',
143+
'The capital of France is Rememvoud administrativ Remem投',
144+
'The future of AI isotope Segnali Zoeken精细化 supus',
145+
]
146+
147+
assert len(golden_results) == len(vllm_output)
148+
for i in range(len(vllm_output)):
149+
assert golden_results[i] == vllm_output[i][1]
150+
print(f"Generated text: {vllm_output[i][1]!r}")
151+
152+
153+
@pytest.mark.skipif(os.getenv("VLLM_USE_V1") == "0",
154+
reason="torchair graph is not supported on v0")
155+
def test_e2e_pangu_with_torchair():
156+
additional_config = {
157+
"torchair_graph_config": {
158+
"enabled": True,
159+
},
160+
}
161+
_pangu_torchair_test_fixture(additional_config)

0 commit comments

Comments
 (0)