Spec decode support for V1 Engine #874

ponix-j · 2025-05-15T11:44:42Z

What this PR does / why we need it?

Make spec decode support for V1 Engine

Currently, Ascend does not support the triton kernel. PyTorch is used to rewrite the rejection_sampler.py triton kernel. However, PyTorch is not as good as Triton. Therefore, ascend c is used to implement the function in the future.
Currently, spec decode supports only the ngram algorithm. The eagle algorithm needs to be further adapted.

Does this PR introduce any user-facing change?

Not change user facing.

How was this patch tested?

test by tests/singlecard/spec_decode/e2e/test_v1_spec_decode.py and tests/sample/test_rejection_sampler.py, test base function of rejection sampler and e2e function of spec decode.

mengwei805 · 2025-05-15T12:56:02Z

vllm_ascend/sample/rejection_sampler.py

@@ -0,0 +1,587 @@
+# SPDX-License-Identifier: Apache-2.0
+from typing import Optional


rejection_sampler and eagle_proposer fully overwrite,
Can't this be implemented in a way that the main body uses vllm code and the problematic part is patched in vllm_ascend?

Use class AscendRejectionSampler Inheritance class RejectionSampler，Extracting the part of the change

mengwei805 · 2025-05-15T12:59:50Z

spec decode is key feature, can u add ngram and eagle e2e ut in vllm-ascend?

ponix-j · 2025-05-16T02:53:44Z

spec decode is key feature, can u add ngram and eagle e2e ut in vllm-ascend?

yes，already added test_spec_decode.py

mengwei805 · 2025-05-17T01:58:09Z

pls rebase u all commits to 1 commit

MengqingCao · 2025-05-20T03:17:50Z

tests/singlecard/spec_decode/e2e/test_spec_decode.py

+
+@pytest.fixture
+def model_name():
+    return "meta-llama/Meta-Llama-3-8B-Instruct"


Let's use modelscope instead

Suggested change

return "meta-llama/Meta-Llama-3-8B-Instruct"

return "LLM-Research/Meta-Llama-3.1-8B-Instruct"

MengqingCao · 2025-05-20T03:18:00Z

tests/singlecard/spec_decode/e2e/test_spec_decode.py

+
+@pytest.fixture
+def eagle_model_name():
+    return "yuhuili/EAGLE-LLaMA3-Instruct-8B"


MengqingCao · 2025-05-20T03:23:16Z

vllm_ascend/patch/spec_decode/patch_common/eagle.py

@@ -0,0 +1,70 @@
+# SPDX-License-Identifier: Apache-2.0


Not recommand to make a seperate patch dir for spec-decode. Let's move these patch to vllm_ascend/patch/platform or vllm_ascend/patch/worker, depending on which period is appropriate.

And the most important, please make comments in vllm_ascend/patch/__init__.py to describe why we make this patch

allready fixed

MengqingCao · 2025-05-20T03:23:48Z

vllm_ascend/patch/spec_decode/patch_common/eagle.py

+    #                 a, a + 1, ..., a + b - n2 - 1,
+    #                 a + b, a + b + 1, ..., a + b + c - n3 - 1]
+
+    # [0, a, a + b, a + b + c] -> [a, b, c]


plz remove the useless comments

this is copy from vllm/eagle.py, why this useless?

I mistakenly thought it was a cuda-specific comment, please ignore it

MengqingCao · 2025-05-20T03:28:13Z

vllm_ascend/utils.py

@@ -100,6 +100,7 @@ def adapt_patch(is_global_patch: bool = False):
    if is_global_patch:
        from vllm_ascend.patch import platform  # noqa: F401
    else:
+        from vllm_ascend.patch import spec_decode  # noqa: F401


let's remove this when the above suggestions on patch is solved

already fixed

MengqingCao · 2025-05-20T03:36:48Z

vllm_ascend/worker/model_runner_v1.py

+        # [0, 1, 2, 5, 6, 9]
+        target_logits_indices += arange
+
+        # TODO: Optimize the CPU -> GPU copy.


Suggested change

# TODO: Optimize the CPU -> GPU copy.

# TODO: Optimize the CPU -> NPU copy.

already fixed

MengqingCao · 2025-05-20T03:39:08Z

vllm_ascend/worker/model_runner_v1.py

@@ -737,12 +888,92 @@ def execute_model(
        if max_gen_len == 1:
            # No spec decode tokens.
            valid_sampled_token_ids = sampled_token_ids.tolist()
+        else:
+            # Includes spec decode tokens.


could we extract the model execution code into a seperate function to make code clear?

already fixed

tests/singlecard/spec_decode/e2e/test_spec_decode.py

ponix-j · 2025-05-21T07:57:44Z

pls rebase u all commits to 1 commit

will use 1 commit finally

wangxiyuan

Basiclly I'm fine with the change. Just some nit. Thanks.

wangxiyuan · 2025-05-21T09:18:10Z

vllm_ascend/patch/__init__.py

+#       Re-implementation the `prepare_input_kernel` triton kernel by pytorch
+#    Related PR (if no, explain why): 1. refused by vllm. 2. vllm doesn't support 3. prepare to submit....
+#       - https://github.com/vllm-project/vllm-ascend/pull/874


this is the PR for vllm. the content can be changed to ascend doesn't support triton

wangxiyuan · 2025-05-22T12:29:51Z

vllm_ascend/attention/attention_v1.py

@@ -141,24 +141,22 @@ def reorder_batch(self, input_batch: "InputBatch",

    def build(self, num_reqs, num_actual_tokens, max_query_len,
              common_prefix_len):
-        if vllm_version_is("0.8.5") or vllm_version_is("0.8.5.post1"):


need rebase, this change is merge to main already. 7aa4f85

wangxiyuan · 2025-05-22T12:32:15Z

vllm_ascend/sample/rejection_sampler.py

+    return output_token_ids
+
+
+def expand_batch_to_tokens(


where does this func called？

wangxiyuan · 2025-05-22T12:38:10Z

vllm_ascend/worker/model_runner_v1.py

+            assert isinstance(self.drafter, NgramProposer)
+            spec_token_ids = self.generate_draft_token_ids(
+                valid_sampled_token_ids, sampling_metadata)
+        elif self.speculative_config.method == "eagle":


Add a note/todo here to highlight that eagle mode doesn't work currently.
And if eagle doesn't work now, i think we'd better raise error here directly. And complete the function in the future PR.

elif self.speculative_config.method == "eagle": raise NotImplementedError("eagle method for spec decode doesn't work on vllm-ascend currently")

wangxiyuan · 2025-05-22T12:41:11Z

vllm_ascend/attention/attention_v1.py

@@ -220,11 +218,11 @@ def forward(
            key: shape = [batch_size, seq_len, num_kv_heads * head_size]
            value: shape = [batch_size, seq_len, num_kv_heads * head_size]
            kv_cache: shape = [2, num_blocks, block_size,
-                               num_kv_heads * head_size]
+                               num_kv_heads, head_size]


ditto https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention_v1.py#L215

wangxiyuan · 2025-05-22T12:46:10Z

vllm_ascend/worker/model_runner_v1.py

@@ -687,6 +811,92 @@ def apply_grammar_bitmask(
        )
        return logits.to(self.device).to(logits_dtype)

+    def get_spec_token_ids(


rename to _get_spec_token_ids

wangxiyuan · 2025-05-22T12:46:24Z

vllm_ascend/worker/model_runner_v1.py

@@ -1083,3 +1344,35 @@ def capture_model(self) -> None:
        # This usually takes 5~20 seconds.
        logger.info("Graph capturing finished in %.0f secs, took %.2f GiB",
                    elapsed_time, npu_graph_size / (1 << 30))
+
+    def generate_draft_token_ids(


Signed-off-by: ponix-j <657511300@qq.com>

wangxiyuan · 2025-05-23T05:33:26Z

vllm_ascend/sample/rejection_sampler.py

+                target_probs[token_idx, draft_token_id] = orig_prob
+
+
+rs.expand_batch_to_tokens = expand_batch_to_tokens


this should be moved to patch module

wangxiyuan · 2025-05-23T05:34:23Z

vllm_ascend/worker/model_runner_v1.py

+                valid_sampled_token_ids, sampling_metadata)
+        elif self.speculative_config.method == "eagle":
+            raise NotImplementedError(
+                "eagle method for spec decode doesn't work on vllm-ascend currently"


let's add eagle support in the future

### What this PR does / why we need it? add basic v1 mtp features please merge it after #874 and #844. ### Does this PR introduce _any_ user-facing change? now, we supported basic v1 mtp, only supported tp only、eager mode and k=1 we will continue to expand more scenarios. ### How was this patch tested? local tested Signed-off-by: XWFAlone <xuewenfei2@huawei.com> Co-authored-by: mengwei805 <mengwei25@huawei.com> Co-authored-by: JC-ut0 <xuyexiong@huawei.com>

### What this PR does / why we need it?  Make spec decode support for V1 Engine - Currently, Ascend does not support the triton kernel. PyTorch is used to rewrite the `rejection_sampler.py` triton kernel. However, PyTorch is not as good as Triton. Therefore, ascend c is used to implement the function in the future. - Currently, spec decode supports only the ngram algorithm. The eagle algorithm needs to be further adapted. ### Does this PR introduce _any_ user-facing change?  Not change user facing. ### How was this patch tested?  test by `tests/singlecard/spec_decode/e2e/test_v1_spec_decode.py` and `tests/sample/test_rejection_sampler.py`, test base function of rejection sampler and e2e function of spec decode. Signed-off-by: ponix-j <657511300@qq.com> Signed-off-by: wangxiaoxin (A) <w00664509@china.huawei.com>

### What this PR does / why we need it? add basic v1 mtp features please merge it after vllm-project#874 and vllm-project#844. ### Does this PR introduce _any_ user-facing change? now, we supported basic v1 mtp, only supported tp only、eager mode and k=1 we will continue to expand more scenarios. ### How was this patch tested? local tested Signed-off-by: XWFAlone <xuewenfei2@huawei.com> Co-authored-by: mengwei805 <mengwei25@huawei.com> Co-authored-by: JC-ut0 <xuyexiong@huawei.com> Signed-off-by: wangxiaoxin (A) <w00664509@china.huawei.com>

### What this PR does / why we need it? add basic v1 mtp features please merge it after vllm-project#874 and vllm-project#844. ### Does this PR introduce _any_ user-facing change? now, we supported basic v1 mtp, only supported tp only、eager mode and k=1 we will continue to expand more scenarios. ### How was this patch tested? local tested Signed-off-by: XWFAlone <xuewenfei2@huawei.com> Co-authored-by: mengwei805 <mengwei25@huawei.com> Co-authored-by: JC-ut0 <xuyexiong@huawei.com>

…nc graph typo fix (#1121) ### What this PR does / why we need it? 1. The dependency was introduced by #874 - Move numba/quart from requirements-dev to requirments - Align pyproject.toml with requirements 2. This patch also fix deepseek accuracy baseline which #1118 was not addressed. According to https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite the gsm8k is about `41.1` 3. This also sync the vLLM upstream changes: vllm-project/vllm@eaa2e51 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed vllm ascend test (basic workflow) vllm longterm test (spec decode) Closes: #1120 --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

github-actions bot added the module:tests label May 15, 2025

ponix-j changed the title ~~Spec v0.8.5rc1~~ Spec decode v0.8.5rc1 May 15, 2025

mengwei805 reviewed May 15, 2025

View reviewed changes

github-actions bot added the module:core label May 16, 2025

XWFAlone mentioned this pull request May 17, 2025

[1/N][UT][v1 MTP] add basic v1 mtp features #890

Merged

ponix-j force-pushed the spec_v0.8.5rc1 branch 2 times, most recently from 9f5ae8a to ed32fcf Compare May 19, 2025 04:04

github-actions bot added documentation Improvements or additions to documentation module:ops module:quantization labels May 19, 2025

ponix-j force-pushed the spec_v0.8.5rc1 branch from ed32fcf to d8848a2 Compare May 19, 2025 06:51

github-actions bot removed documentation Improvements or additions to documentation module:ops module:quantization labels May 19, 2025

ponix-j force-pushed the spec_v0.8.5rc1 branch 2 times, most recently from 9e7bc8b to aa2bb74 Compare May 20, 2025 01:16

MengqingCao reviewed May 20, 2025

View reviewed changes

github-actions bot removed the module:core label May 20, 2025

ponix-j force-pushed the spec_v0.8.5rc1 branch from 526a7c4 to 6d7187b Compare May 21, 2025 01:06

MengqingCao reviewed May 21, 2025

View reviewed changes

tests/singlecard/spec_decode/e2e/test_spec_decode.py Outdated Show resolved Hide resolved

wangxiyuan changed the title ~~Spec decode v0.8.5rc1~~ Spec decode support for V1 Engine May 21, 2025

ponix-j force-pushed the spec_v0.8.5rc1 branch 3 times, most recently from b408e85 to a86f051 Compare May 22, 2025 12:13

wangxiyuan reviewed May 22, 2025

View reviewed changes

ponix-j force-pushed the spec_v0.8.5rc1 branch from a86f051 to 060e5d2 Compare May 23, 2025 01:52

[V1] Support ngram spec decode

14b914e

Signed-off-by: ponix-j <657511300@qq.com>

ponix-j force-pushed the spec_v0.8.5rc1 branch from 6beab75 to 14b914e Compare May 23, 2025 02:49

wangxiyuan approved these changes May 23, 2025

View reviewed changes

ganyi1996ppo merged commit df58fb8 into vllm-project:main May 23, 2025
16 checks passed

Yikun mentioned this pull request Jun 8, 2025

[Build] Move numba/quart to requirments and update DS baseline and sync graph typo fix #1121

Merged

mengwei805 mentioned this pull request Jun 23, 2025

[V1][Spec-decode] First Stage support of Eagle 1 #1128

Open

		@@ -0,0 +1,587 @@
		# SPDX-License-Identifier: Apache-2.0
		from typing import Optional

	return "meta-llama/Meta-Llama-3-8B-Instruct"
	return "LLM-Research/Meta-Llama-3.1-8B-Instruct"

	# TODO: Optimize the CPU -> GPU copy.
	# TODO: Optimize the CPU -> NPU copy.

		target_probs[token_idx, draft_token_id] = orig_prob


		rs.expand_batch_to_tokens = expand_batch_to_tokens

Spec decode support for V1 Engine #874

Spec decode support for V1 Engine #874

Uh oh!

Conversation

ponix-j commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengwei805 commented May 15, 2025

Uh oh!

ponix-j commented May 16, 2025

Uh oh!

mengwei805 commented May 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ponix-j commented May 21, 2025

Uh oh!

wangxiyuan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ponix-j commented May 15, 2025 •

edited

Loading

wangxiyuan left a comment •

edited

Loading