Updated on 2024-11-06

lxaw · lxaw · commit 7bdf6b414d31 · 2024-11-06T11:37:30.000-05:00
diff --git a/papers/list.json b/papers/list.json
@@ -1,4 +1,13 @@
 [
+  {
+    "title": "Efficient Streaming Language Models with Attention Sinks",
+    "author": "Guangxuan Xiao et al",
+    "year": "2024",
+    "topic": "llm, kv cache, attention",
+    "venue": "ICLR",
+    "description": "This paper introduces StreamingLLM, a framework that enables large language models to process infinitely long text sequences efficiently without fine-tuning, based on a key insight about \"attention sinks.\" The authors discover that LLMs allocate surprisingly high attention scores to initial tokens regardless of their semantic relevance, which they explain is due to the softmax operation requiring attention scores to sum to one - even when a token has no strong matches in context, the model must distribute attention somewhere, and initial tokens become natural \"sinks\" since they're visible to all subsequent tokens during autoregressive training. Building on this insight, StreamingLLM maintains just a few initial tokens (as attention sinks) along with a sliding window of recent tokens, achieving up to 22.2x speedup compared to baselines while maintaining performance on sequences up to 4 million tokens long. Additionally, they show that incorporating a dedicated learnable \"sink token\" during model pre-training can further improve streaming capabilities by providing an explicit token for collecting excess attention.",
+    "link": "https://arxiv.org/pdf/2309.17453"
+  },
   {
     "title": "MagicPIG: LSH Sampling for Efficient LLM Generation",
     "author": "Zhuoming Chen et al",
diff --git a/papers_read.html b/papers_read.html
@@ -75,10 +75,10 @@ <h1>Here's where I keep a list of papers I have read.</h1>
         I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.
     </p>
     <p id="paperCount">
-        So far, we have read 155 papers. Let's keep it up!
+        So far, we have read 156 papers. Let's keep it up!
     </p> 
     <small id="searchCount">
-        Your search returned 155 papers. Nice! 
+        Your search returned 156 papers. Nice! 
     </small>
     
     <div class="search-inputs">
@@ -105,6 +105,16 @@ <h1>Here's where I keep a list of papers I have read.</h1>
         </thead>
         <tbody>
         
+            <tr>
+                <td>Efficient Streaming Language Models with Attention Sinks</td>
+                <td>Guangxuan Xiao et al</td>
+                <td>2024</td>
+                <td>llm, kv cache, attention</td>
+                <td>ICLR</td>
+                <td>This paper introduces StreamingLLM, a framework that enables large language models to process infinitely long text sequences efficiently without fine-tuning, based on a key insight about &quot;attention sinks.&quot; The authors discover that LLMs allocate surprisingly high attention scores to initial tokens regardless of their semantic relevance, which they explain is due to the softmax operation requiring attention scores to sum to one - even when a token has no strong matches in context, the model must distribute attention somewhere, and initial tokens become natural &quot;sinks&quot; since they&#x27;re visible to all subsequent tokens during autoregressive training. Building on this insight, StreamingLLM maintains just a few initial tokens (as attention sinks) along with a sliding window of recent tokens, achieving up to 22.2x speedup compared to baselines while maintaining performance on sequences up to 4 million tokens long. Additionally, they show that incorporating a dedicated learnable &quot;sink token&quot; during model pre-training can further improve streaming capabilities by providing an explicit token for collecting excess attention.</td>
+                <td><a href="https://arxiv.org/pdf/2309.17453" target="_blank">Link</a></td>
+            </tr>
+        
             <tr>
                 <td>MagicPIG: LSH Sampling for Efficient LLM Generation</td>
                 <td>Zhuoming Chen et al</td>

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,13 @@`
`1`	`1`	`[`
	`2`	`+ {`
	`3`	`+ "title": "Efficient Streaming Language Models with Attention Sinks",`
	`4`	`+ "author": "Guangxuan Xiao et al",`
	`5`	`+ "year": "2024",`
	`6`	`+ "topic": "llm, kv cache, attention",`
	`7`	`+ "venue": "ICLR",`
	`8`	+ "description": "This paper introduces StreamingLLM, a framework that enables large language models to process infinitely long text sequences efficiently without fine-tuning, based on a key insight about \"attention sinks.\" The authors discover that LLMs allocate surprisingly high attention scores to initial tokens regardless of their semantic relevance, which they explain is due to the softmax operation requiring attention scores to sum to one - even when a token has no strong matches in context, the model must distribute attention somewhere, and initial tokens become natural \"sinks\" since they're visible to all subsequent tokens during autoregressive training. Building on this insight, StreamingLLM maintains just a few initial tokens (as attention sinks) along with a sliding window of recent tokens, achieving up to 22.2x speedup compared to baselines while maintaining performance on sequences up to 4 million tokens long. Additionally, they show that incorporating a dedicated learnable \"sink token\" during model pre-training can further improve streaming capabilities by providing an explicit token for collecting excess attention.",
	`9`	`+ "link": "https://arxiv.org/pdf/2309.17453"`
	`10`	`+ },`
`2`	`11`	`{`
`3`	`12`	`"title": "MagicPIG: LSH Sampling for Efficient LLM Generation",`
`4`	`13`	`"author": "Zhuoming Chen et al",`