You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: papers/list.json
+18Lines changed: 18 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,22 @@
1
1
[
2
+
{
3
+
"title": "Sequence-Level Knowledge Distillation",
4
+
"author": "Yoon Kim et al",
5
+
"year": "2016",
6
+
"topic": "knowledge distillation",
7
+
"venue": "Arxiv",
8
+
"description": "This paper introduces sequence-level knowledge distillation for neural machine translation, allowing smaller student models to achieve performance comparable to larger teacher models. The authors demonstrate that their approach works better than standard word-level knowledge distillation by having students learn from complete translations generated by the teacher rather than just matching word-level probabilities. Remarkably, their method enables student models to perform well even with greedy decoding, eliminating the need for computationally expensive beam search at inference time. Combining their distillation techniques with weight pruning, they produce models with 13× fewer parameters than the original teacher model while maintaining strong translation performance, making efficient NMT deployment possible even on mobile devices.",
9
+
"link": "https://arxiv.org/pdf/1606.07947"
10
+
},
11
+
{
12
+
"title": "The Mamba in the Llama: Distilling and Accelerating Hybrid Models",
13
+
"author": "Junxiong Wang et al",
14
+
"year": "2025",
15
+
"topic": "knowledge distillation, llm",
16
+
"venue": "Arxiv",
17
+
"description": "This paper demonstrates how large Transformer models can be effectively distilled into hybrid models that incorporate linear RNNs like Mamba while maintaining much of their generation quality, notably by reusing the weights from attention layers. The researchers developed a multistage distillation approach combining progressive distillation, supervised fine-tuning, and directed preference optimization, which outperforms models trained from scratch with trillions of tokens. They also introduced a hardware-aware speculative decoding algorithm that significantly accelerates inference speed for both Mamba and hybrid architectures, achieving impressive throughput for large language models. The resulting hybrid models show comparable performance to the original Transformers on chat benchmarks while requiring less computational resources for deployment, highlighting how transformer knowledge can be effectively transferred to other architectures with customized inference profiles.",
18
+
"link": "https://arxiv.org/pdf/2408.15237"
19
+
},
2
20
{
3
21
"title": "Compact Language Models via Pruning and Knowledge Distillation",
Copy file name to clipboardExpand all lines: papers_read.html
+22-2Lines changed: 22 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -16,10 +16,10 @@ <h1>Here's where I keep a list of papers I have read.</h1>
16
16
I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.
17
17
</p>
18
18
<pid="paperCount">
19
-
So far, we have read 229 papers. Let's keep it up!
19
+
So far, we have read 231 papers. Let's keep it up!
20
20
</p>
21
21
<smallid="searchCount">
22
-
Your search returned 229 papers. Nice!
22
+
Your search returned 231 papers. Nice!
23
23
</small>
24
24
25
25
<divclass="search-inputs">
@@ -46,6 +46,26 @@ <h1>Here's where I keep a list of papers I have read.</h1>
46
46
</thead>
47
47
<tbody>
48
48
49
+
<tr>
50
+
<td>Sequence-Level Knowledge Distillation</td>
51
+
<td>Yoon Kim et al</td>
52
+
<td>2016</td>
53
+
<td>knowledge distillation</td>
54
+
<td>Arxiv</td>
55
+
<td>This paper introduces sequence-level knowledge distillation for neural machine translation, allowing smaller student models to achieve performance comparable to larger teacher models. The authors demonstrate that their approach works better than standard word-level knowledge distillation by having students learn from complete translations generated by the teacher rather than just matching word-level probabilities. Remarkably, their method enables student models to perform well even with greedy decoding, eliminating the need for computationally expensive beam search at inference time. Combining their distillation techniques with weight pruning, they produce models with 13× fewer parameters than the original teacher model while maintaining strong translation performance, making efficient NMT deployment possible even on mobile devices.</td>
<td>The Mamba in the Llama: Distilling and Accelerating Hybrid Models</td>
61
+
<td>Junxiong Wang et al</td>
62
+
<td>2025</td>
63
+
<td>knowledge distillation, llm</td>
64
+
<td>Arxiv</td>
65
+
<td>This paper demonstrates how large Transformer models can be effectively distilled into hybrid models that incorporate linear RNNs like Mamba while maintaining much of their generation quality, notably by reusing the weights from attention layers. The researchers developed a multistage distillation approach combining progressive distillation, supervised fine-tuning, and directed preference optimization, which outperforms models trained from scratch with trillions of tokens. They also introduced a hardware-aware speculative decoding algorithm that significantly accelerates inference speed for both Mamba and hybrid architectures, achieving impressive throughput for large language models. The resulting hybrid models show comparable performance to the original Transformers on chat benchmarks while requiring less computational resources for deployment, highlighting how transformer knowledge can be effectively transferred to other architectures with customized inference profiles.</td>
0 commit comments