Updated on 2025-03-04

lxaw · lxaw · commit 4a2cbd62ee05 · 2025-03-04T11:24:51.000-05:00
diff --git a/papers/list.json b/papers/list.json
@@ -1,4 +1,22 @@
 [
+  {
+    "title": "Sequence-Level Knowledge Distillation",
+    "author": "Yoon Kim et al",
+    "year": "2016",
+    "topic": "knowledge distillation",
+    "venue": "Arxiv",
+    "description": "This paper introduces sequence-level knowledge distillation for neural machine translation, allowing smaller student models to achieve performance comparable to larger teacher models. The authors demonstrate that their approach works better than standard word-level knowledge distillation by having students learn from complete translations generated by the teacher rather than just matching word-level probabilities. Remarkably, their method enables student models to perform well even with greedy decoding, eliminating the need for computationally expensive beam search at inference time. Combining their distillation techniques with weight pruning, they produce models with 13× fewer parameters than the original teacher model while maintaining strong translation performance, making efficient NMT deployment possible even on mobile devices.",
+    "link": "https://arxiv.org/pdf/1606.07947"
+  },
+  {
+    "title": "The Mamba in the Llama: Distilling and Accelerating Hybrid Models",
+    "author": "Junxiong Wang et al",
+    "year": "2025",
+    "topic": "knowledge distillation, llm",
+    "venue": "Arxiv",
+    "description": "This paper demonstrates how large Transformer models can be effectively distilled into hybrid models that incorporate linear RNNs like Mamba while maintaining much of their generation quality, notably by reusing the weights from attention layers. The researchers developed a multistage distillation approach combining progressive distillation, supervised fine-tuning, and directed preference optimization, which outperforms models trained from scratch with trillions of tokens. They also introduced a hardware-aware speculative decoding algorithm that significantly accelerates inference speed for both Mamba and hybrid architectures, achieving impressive throughput for large language models. The resulting hybrid models show comparable performance to the original Transformers on chat benchmarks while requiring less computational resources for deployment, highlighting how transformer knowledge can be effectively transferred to other architectures with customized inference profiles.",
+    "link": "https://arxiv.org/pdf/2408.15237"
+  },
   {
     "title": "Compact Language Models via Pruning and Knowledge Distillation",
     "author": "Saurav Muralidharan et al",
diff --git a/papers_read.html b/papers_read.html
@@ -16,10 +16,10 @@ <h1>Here's where I keep a list of papers I have read.</h1>
         I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.
     </p>
     <p id="paperCount">
-        So far, we have read 229 papers. Let's keep it up!
+        So far, we have read 231 papers. Let's keep it up!
     </p> 
     <small id="searchCount">
-        Your search returned 229 papers. Nice! 
+        Your search returned 231 papers. Nice! 
     </small>
     
     <div class="search-inputs">
@@ -46,6 +46,26 @@ <h1>Here's where I keep a list of papers I have read.</h1>
         </thead>
         <tbody>
         
+            <tr>
+                <td>Sequence-Level Knowledge Distillation</td>
+                <td>Yoon Kim et al</td>
+                <td>2016</td>
+                <td>knowledge distillation</td>
+                <td>Arxiv</td>
+                <td>This paper introduces sequence-level knowledge distillation for neural machine translation, allowing smaller student models to achieve performance comparable to larger teacher models. The authors demonstrate that their approach works better than standard word-level knowledge distillation by having students learn from complete translations generated by the teacher rather than just matching word-level probabilities. Remarkably, their method enables student models to perform well even with greedy decoding, eliminating the need for computationally expensive beam search at inference time. Combining their distillation techniques with weight pruning, they produce models with 13× fewer parameters than the original teacher model while maintaining strong translation performance, making efficient NMT deployment possible even on mobile devices.</td>
+                <td><a href="https://arxiv.org/pdf/1606.07947" target="_blank">Link</a></td>
+            </tr>
+        
+            <tr>
+                <td>The Mamba in the Llama: Distilling and Accelerating Hybrid Models</td>
+                <td>Junxiong Wang et al</td>
+                <td>2025</td>
+                <td>knowledge distillation, llm</td>
+                <td>Arxiv</td>
+                <td>This paper demonstrates how large Transformer models can be effectively distilled into hybrid models that incorporate linear RNNs like Mamba while maintaining much of their generation quality, notably by reusing the weights from attention layers. The researchers developed a multistage distillation approach combining progressive distillation, supervised fine-tuning, and directed preference optimization, which outperforms models trained from scratch with trillions of tokens. They also introduced a hardware-aware speculative decoding algorithm that significantly accelerates inference speed for both Mamba and hybrid architectures, achieving impressive throughput for large language models. The resulting hybrid models show comparable performance to the original Transformers on chat benchmarks while requiring less computational resources for deployment, highlighting how transformer knowledge can be effectively transferred to other architectures with customized inference profiles.</td>
+                <td><a href="https://arxiv.org/pdf/2408.15237" target="_blank">Link</a></td>
+            </tr>
+        
             <tr>
                 <td>Compact Language Models via Pruning and Knowledge Distillation</td>
                 <td>Saurav Muralidharan et al</td>