speculative decoding and mtp optimization #1435

inkcherry · 2025-06-16T08:53:02Z

from

to

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

This PR enabled a profiler for each step of the LLMEngine. The following 4 ENVs are used to control the profiler: - VLLM_ENGINE_PROFILER_ENABLED, set to true to enable device profiler. - VLLM_ENGINE_PROFILER_WARMUP_STEPS, number of steps to ignore for profiling. - VLLM_ENGINE_PROFILER_STEPS, number of steps to capture for profiling. - VLLM_ENGINE_PROFILER_REPEAT, number of cycles for (warmup + profile). > Please refer to [torch.profiler.schedule](https://pytorch.org/docs/stable/profiler.html#torch.profiler.schedule) for more details about the profiler schedule arguments. > The step in profiling means a step of the LLM engine and **exclude** the profile and warmup run in HabanaModelRunner. Please use `export VLLM_PROFILER_ENABLED=True` to enable the high-level vLLM profiler and use the result to choose the steps for detailed LLMEngine profiling.

Signed-off-by: jkyu <jiankang.yu@intel.com>

reduce the communication to optimize the perf Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com>

benchislett and others added 29 commits April 24, 2025 08:51

Expand DeepSeek MTP code to support k > n_predict (vllm-project#13626)

e792c2a

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Draft version for MTP perf optimization

3a3d25e

Signed-off-by: jkyu <jiankang.yu@intel.com>

mydebug

9daa0e9

my debug2

c10b26c

fix hidden1

b90b323

wa for crash

c7a42a9

update metadata more previous

1446058

update

83fd06c

update

a3f5818

accuracy pass

dc015fd

wa for text accuracy

122b1c1

delete test files

4491171

remove some debug code

a3986eb

refine the code logic and remove some print

2d7251f

Signed-off-by: jkyu <jiankang.yu@intel.com>

add the batch size for the fake token

2cbd5a9

Signed-off-by: jkyu <jiankang.yu@intel.com>

update to prepare for bs > 1

11e697b

Signed-off-by: jkyu <jiankang.yu@intel.com>

eos support

dd803ab

eos stop support

e455f70

token_max_len stop support

aabc738

re-implement the code logic and enable the support for bs > 1

ce084e0

Signed-off-by: jkyu <jiankang.yu@intel.com>

Merge branch 'spdecode_and_mpt_optimization' into HEAD

194eaf7

fix for batch_size=1

f92a446

my debug

79694c0

optimization perf

645f211

update1

20caa19

merge

daac15d

clean up

e37cecd

clean up

d43a676

inkcherry changed the title ~~speculative decoding and mpt optimization~~ speculative decoding and mtp optimization Jun 18, 2025

# This is a combination of 2 commits.

d97c7dd

reduce the communication to optimize the perf Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

speculative decoding and mtp optimization #1435

speculative decoding and mtp optimization #1435

Uh oh!

inkcherry commented Jun 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

speculative decoding and mtp optimization #1435

Are you sure you want to change the base?

speculative decoding and mtp optimization #1435

Uh oh!

Conversation

inkcherry commented Jun 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

inkcherry commented Jun 16, 2025 •

edited by github-actions bot

Loading