[Perf][MTP] Improve speculative decoding efficiency #2818

freeliuzc · 2025-07-11T11:08:58Z

Reduced kernel launch latency:

The original custom op library was too large, causing kernel launch latency to reach ~200µs. By using PyBind instead, launch latency was reduced to ~20µs.
Synchronization improvements:
Optimized several sync points in the speculative decoding path.

Conclusion：
In the ERNIE-LITE + MTP pipeline, the time from verify to the next main model C-Embedding was reduced from 8.5ms to 4ms(113%).

paddle-bot · 2025-07-11T11:09:03Z

Thanks for your contribution!

optimize infer speed

679546a

freeliuzc added 2 commits July 11, 2025 19:21

fix merge bug

1395fd8

fix merge bug

e9867bf

freeliuzc closed this Jul 14, 2025

Provide feedback