add todo and faq

MengqingCao · MengqingCao · commit 594721afb9d8 · 2025-05-28T03:23:25.000Z
Signed-off-by: MengqingCao &lt;cmq0113@163.com&gt;
diff --git a/docs/source/faqs.md b/docs/source/faqs.md
@@ -119,3 +119,13 @@ In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynam
 - **Adjust `--gpu-memory-utilization`**: If unspecified, will use the default value of `0.9`. You can decrease this param to reserve more memory to reduce fragmentation risks. See more note in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
 
 - **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
+
+### 15. Failed to enable NPU graph mode when running DeepSeek?
+You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.
+
+And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.
+
+```bash
+[rank0]: RuntimeError: EZ9999: Inner Error!
+[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
+```
diff --git a/vllm_ascend/attention/attention.py b/vllm_ascend/attention/attention.py
@@ -1008,6 +1008,7 @@ def __init__(
         if additional_config:
             self.enable_graph_mode = additional_config.get(
                 "enable_graph_mode", False)
+        # TODO: support numHeads / numKvHeads < 16 in MLA kernel
         if self.enable_graph_mode:
             assert self.num_queries_per_kv in ALLOWED_NUM_QUERIES_PER_KV, \
                 ("The allowed number of queries per kv when enabling both MLA and Graph mode"