You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"description": "This paper presents Medusa which augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. They also introduce a form of tree-based attention to process candidates. Through the Medusa heads, they obtain probability predictions for the subsequent K+1 tokens. These predictions enable them to create length-K+1 continuations as the candidates. In order to process multiple cnadidates concurrently, they structure their attention such that only tokens from the same continuation are regarded as historical data.For instance, they have in Figure 2 an example where the first Medusa head and generates some top two predictions while the second medusa head generates a top three for each of the top two from the first head. Instead of filling the entire attention mask, they only consider the mask from these 2*3 = 6 tokens, plus the standard identity line.",
9
+
"link": "https://arxiv.org/pdf/2401.10774"
10
+
},
2
11
{
3
12
"title": "Recurrent Drafter for Fast Speculative Decoding in Large Language Models",
4
13
"author": "Yunfei Cheng et al",
5
14
"year": "2024",
6
15
"topic": "speculative decoding, drafting, llm",
7
16
"venue": "Arxiv",
8
-
"description": "This paper indroduces ReDrafter (Recurrent Drafter) that uses an RNN as the draft model and conditions on the LLM's hidden states. They use a beam search to explore the candidate seqeunces and then apply a dynamic tree attention alg to remove duplicated prefixes among the candidates to improve the speedup. They also train via knowledge distillation from LLMs to improve the alignment of the draft model's predictions with those of the LLM.",
17
+
"description": "This paper introduces ReDrafter (Recurrent Drafter) that uses an RNN as the draft model and conditions on the LLM's hidden states. They use a beam search to explore the candidate seqeunces and then apply a dynamic tree attention alg to remove duplicated prefixes among the candidates to improve the speedup. They also train via knowledge distillation from LLMs to improve the alignment of the draft model's predictions with those of the LLM.",
Copy file name to clipboardExpand all lines: papers_read.html
+13-3Lines changed: 13 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -16,10 +16,10 @@ <h1>Here's where I keep a list of papers I have read.</h1>
16
16
I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.
17
17
</p>
18
18
<pid="paperCount">
19
-
So far, we have read 205 papers. Let's keep it up!
19
+
So far, we have read 206 papers. Let's keep it up!
20
20
</p>
21
21
<smallid="searchCount">
22
-
Your search returned 205 papers. Nice!
22
+
Your search returned 206 papers. Nice!
23
23
</small>
24
24
25
25
<divclass="search-inputs">
@@ -46,13 +46,23 @@ <h1>Here's where I keep a list of papers I have read.</h1>
46
46
</thead>
47
47
<tbody>
48
48
49
+
<tr>
50
+
<td>Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads</td>
51
+
<td>Tianle Cai et al</td>
52
+
<td>2024</td>
53
+
<td>speculative decoding, drafting, llm</td>
54
+
<td>ICML</td>
55
+
<td>This paper presents Medusa which augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. They also introduce a form of tree-based attention to process candidates. Through the Medusa heads, they obtain probability predictions for the subsequent K+1 tokens. These predictions enable them to create length-K+1 continuations as the candidates. In order to process multiple cnadidates concurrently, they structure their attention such that only tokens from the same continuation are regarded as historical data.For instance, they have in Figure 2 an example where the first Medusa head and generates some top two predictions while the second medusa head generates a top three for each of the top two from the first head. Instead of filling the entire attention mask, they only consider the mask from these 2*3 = 6 tokens, plus the standard identity line.</td>
<td>Recurrent Drafter for Fast Speculative Decoding in Large Language Models</td>
51
61
<td>Yunfei Cheng et al</td>
52
62
<td>2024</td>
53
63
<td>speculative decoding, drafting, llm</td>
54
64
<td>Arxiv</td>
55
-
<td>This paper indroduces ReDrafter (Recurrent Drafter) that uses an RNN as the draft model and conditions on the LLM's hidden states. They use a beam search to explore the candidate seqeunces and then apply a dynamic tree attention alg to remove duplicated prefixes among the candidates to improve the speedup. They also train via knowledge distillation from LLMs to improve the alignment of the draft model's predictions with those of the LLM.</td>
65
+
<td>This paper introduces ReDrafter (Recurrent Drafter) that uses an RNN as the draft model and conditions on the LLM's hidden states. They use a beam search to explore the candidate seqeunces and then apply a dynamic tree attention alg to remove duplicated prefixes among the candidates to improve the speedup. They also train via knowledge distillation from LLMs to improve the alignment of the draft model's predictions with those of the LLM.</td>
0 commit comments