[New Model]: DeepSeek_V2-lite running on graph mode

### The model to consider.

https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite

### The closest model vllm already supports.

https://huggingface.co/deepseek-ai/DeepSeek-V3

### What's your difficulty of supporting the model you want?

The operators of the torch_npu do not support this model. The ERRORs as the follows:
[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/libingnan/project/test_vllm/test_vllm_ascend_deepseek_v2.py", line 36, in <module>
[rank0]:     outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
[rank0]:   File "/vllm-workspace/vllm-main/vllm/utils.py", line 1212, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/vllm-workspace/vllm-main/vllm/entrypoints/llm.py", line 479, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:   File "/vllm-workspace/vllm-main/vllm/entrypoints/llm.py", line 1464, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:   File "/vllm-workspace/vllm-main/vllm/engine/llm_engine.py", line 1405, in step
[rank0]:     outputs = self.model_executor.execute_model(
[rank0]:   File "/vllm-workspace/vllm-main/vllm/executor/executor_base.py", line 299, in execute_model
[rank0]:     driver_outputs = self._driver_execute_model(execute_model_req)
[rank0]:   File "/vllm-workspace/vllm-main/vllm/executor/mp_distributed_executor.py", line 144, in _driver_execute_model
[rank0]:     return self.driver_worker.execute_model(execute_model_req)
[rank0]:   File "/vllm-workspace/vllm-main/vllm/worker/worker_base.py", line 420, in execute_model
[rank0]:     output = self.model_runner.execute_model(
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/vllm-workspace/vllm-ascend-main/vllm_ascend/worker/model_runner.py", line 1402, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/vllm-workspace/vllm-ascend-main/vllm_ascend/models/deepseek_v2.py", line 670, in forward
[rank0]:     def forward(
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1100, in forward
[rank0]:     return compiled_fn(full_args)
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 321, in runtime_wrapper
[rank0]:     all_outs = call_func_at_runtime_with_args(
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 124, in call_func_at_runtime_with_args
[rank0]:     out = normalize_as_list(f(args))
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 667, in inner_fn
[rank0]:     outs = compiled_fn(args)
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 488, in wrapper
[rank0]:     return compiled_fn(runtime_args)
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 98, in g
[rank0]:     return f(*args)
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/dynamo/torchair/npu_fx_compiler.py", line 260, in __call__
[rank0]:     gm_result = self.runner(*args, **kwargs)
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py", line 537, in __call__
[rank0]:     self.compile()
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py", line 627, in compile
[rank0]:     self.graph.compile()
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/dynamo/torchair/ge/_ge_graph.py", line 657, in compile
[rank0]:     self._executor.compile()
[rank0]:   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/dynamo/torchair/_utils/error_code.py", line 46, in wapper
[rank0]:     raise type(e)("\n".join(msg))
[rank0]: RuntimeError: EZ9999: Inner Error!
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
[rank0]:         TraceBack (most recent call last):
[rank0]:        Assert (((funcs->tiling)(reinterpret_cast<gert::TilingContext *>(tiling_context_holder.context_))) == ge::GRAPH_SUCCESS) failed[FUNC:RtParseAndTiling][FILE:op_tiling_rt2.cc][LINE:526]
[rank0]:        [GenTask][CalcExtOpRunningParam] CalcTilingSinkRunningParam failed.[FUNC:CalcExtOpRunningParam][FILE:aicore_ops_kernel_builder.cc][LINE:259]
[rank0]:        [GenTask][CalcOpRunningParam] CalcExtOpRunningParam failed.[FUNC:CalcOpRunningParam][FILE:aicore_ops_kernel_builder.cc][LINE:227]
[rank0]:        Call Calculate op:FusedInferAttentionScore(FusedInferAttentionScore) running param failed[FUNC:CalcOpParam][FILE:graph_builder.cc][LINE:211]
[rank0]:        [Call][PreRun] Failed, graph_id:0, session_id:0.[FUNC:CompileGraph][FILE:graph_manager.cc][LINE:4545]
[rank0]:        [Compile][Graph]Compile graph failed, error code:1343225857, session_id:0, graph_id:0.[FUNC:CompileGraph][FILE:ge_api.cc][LINE:1280]


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[New Model]: DeepSeek_V2-lite running on graph mode #972

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[New Model]: DeepSeek_V2-lite running on graph mode #972

Description

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions