|
| 1 | +.. |
| 2 | +.. Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +.. |
| 4 | +.. Redistribution and use in source and binary forms, with or without |
| 5 | +.. modification, are permitted provided that the following conditions |
| 6 | +.. are met: |
| 7 | +.. * Redistributions of source code must retain the above copyright |
| 8 | +.. notice, this list of conditions and the following disclaimer. |
| 9 | +.. * Redistributions in binary form must reproduce the above copyright |
| 10 | +.. notice, this list of conditions and the following disclaimer in the |
| 11 | +.. documentation and/or other materials provided with the distribution. |
| 12 | +.. * Neither the name of NVIDIA CORPORATION nor the names of its |
| 13 | +.. contributors may be used to endorse or promote products derived |
| 14 | +.. from this software without specific prior written permission. |
| 15 | +.. |
| 16 | +.. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY |
| 17 | +.. EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE |
| 18 | +.. IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR |
| 19 | +.. PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR |
| 20 | +.. CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, |
| 21 | +.. EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, |
| 22 | +.. PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR |
| 23 | +.. PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY |
| 24 | +.. OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT |
| 25 | +.. (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE |
| 26 | +.. OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
| 27 | +
|
| 28 | +.. raw:: html |
| 29 | + |
| 30 | + |
| 31 | +About Speculative Decoding |
| 32 | +========================= |
| 33 | +Speculative Decoding (also referred to as Speculative Sampling) is a set of techniques designed |
| 34 | +to allow generation of more than one token per forward pass iteration. This can lead to a reduction |
| 35 | +in the average per-token latency in situations where the GPU is underutilized due to small batch sizes. |
| 36 | + |
| 37 | +Speculative decoding involves predicting a sequence of future tokens, referred to as draft tokens, |
| 38 | +using a method that is substantially more efficient than repeatedly executing the target Large Language |
| 39 | +Model (LLM). These draft tokens are then collectively validated by processing them through the target LLM |
| 40 | +in a single forward pass. The underlying assumptions are twofold: |
| 41 | + |
| 42 | +1. processing multiple draft tokens concurrently will be as rapid as processing a single token |
| 43 | +2. multiple draft tokens will be validated successfully over the course of the full generation |
| 44 | + |
| 45 | +If the first assumption holds true, the latency of speculative decoding will no worse than the standard |
| 46 | +approach. If the second holds, output token generation advances by statistically more than one token per |
| 47 | +forward pass. The combination of both these allows speculative decoding to result in reduced latency. |
| 48 | + |
| 49 | +Performance Improvements |
| 50 | +======================== |
| 51 | +It's important to note that the effectiveness of speculative decoding techniques is highly dependent |
| 52 | +on the specific task at hand. For instance, forecasting subsequent tokens in a code-completion scenario |
| 53 | +may prove simpler than generating a summary for an article. `Spec-Bench <https://sites.google.com/view/spec-bench>`__ |
| 54 | +shows the performance of different speculative decoding approaches on different tasks. |
0 commit comments