[Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 (#1416)

yiz-liu · web-flow · commit 2690697caa47 · 2025-06-26T09:27:43.000+08:00
### What this PR does / why we need it? Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds asserts in the `GatherV3` operator. Currently, in [`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124), the `position` tensor may contain values that exceed the dimensions of the attention mask, triggering a `GatherV3` boundary check failure. These invalid indices originate from stale “dirty” entries left over in `position` due to padding logic in the ACL graph. Specifically, in [`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989), the variable `num_input_tokens` is always greater than or equal to `total_num_scheduled_tokens`, so any positions not explicitly cleared from a previous batch will persist and cause this sporadic error. BTW, in the original vLLM implementation, masks are constructed internally using other args, so these lingering values do not surface. However, on the Ascend platform—where split-fuse attention requires externally supplied masks—these residual indices become critical and lead to this elusive, hard-to-reproduce failure. The fix is to explicitly reset or zero out all unused entries in the `position` tensor before passing it to `GatherV3`, ensuring that every index lies within the valid range of the attention mask. Closes: #1038 ### Does this PR introduce _any_ user-facing change? No Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
diff --git a/vllm_ascend/worker/model_runner_v1.py b/vllm_ascend/worker/model_runner_v1.py
@@ -953,6 +953,7 @@ def _process_reqs(
                 self.mrope_positions_cpu[:, :total_num_scheduled_tokens],
                 non_blocking=True)
 
+        self.positions[total_num_scheduled_tokens:num_input_tokens].zero_()
         self.positions[:total_num_scheduled_tokens].copy_(
             self.positions_cpu[:total_num_scheduled_tokens], non_blocking=True)
         positions = self.positions[:num_input_tokens]