Updated on 2024-11-06

lxaw · lxaw · commit 654d6ebc10e7 · 2024-11-06T16:51:50.000-05:00
diff --git a/papers/list.json b/papers/list.json
@@ -1,4 +1,13 @@
 [
+  {
+    "title": "DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads",
+    "author": "Guangxuan Xiao et al",
+    "year": "2024",
+    "topic": "llm, kv cache, attention",
+    "venue": "Arxiv",
+    "description": "The key insight of DuoAttention is the observation that attention heads in LLMs naturally fall into two distinct categories: retrieval heads that need to access the full context to make connections across long distances, and streaming heads that mainly focus on recent tokens and attention sinks. This dichotomy makes intuitive sense because not all parts of language processing require long-range dependencies - while some aspects like fact recall or logical reasoning need broad context, others like local grammar or immediate context processing can work with nearby tokens. The paper's approach of using optimization to identify these heads (rather than just looking at attention patterns) is clever because it directly measures the impact on model outputs, capturing the true functional role of each head rather than just its surface behavior. Finally, the insight to maintain two separate KV caches (full for retrieval heads, minimal for streaming heads) is an elegant way to preserve the model's capabilities while reducing memory usage, since it aligns the memory allocation with each head's actual needs.",
+    "link": "https://arxiv.org/pdf/2410.10819"
+  },
   {
     "title": "Efficient Streaming Language Models with Attention Sinks",
     "author": "Guangxuan Xiao et al",
diff --git a/papers_read.html b/papers_read.html
@@ -75,10 +75,10 @@ <h1>Here's where I keep a list of papers I have read.</h1>
         I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.
     </p>
     <p id="paperCount">
-        So far, we have read 156 papers. Let's keep it up!
+        So far, we have read 157 papers. Let's keep it up!
     </p> 
     <small id="searchCount">
-        Your search returned 156 papers. Nice! 
+        Your search returned 157 papers. Nice! 
     </small>
     
     <div class="search-inputs">
@@ -105,6 +105,16 @@ <h1>Here's where I keep a list of papers I have read.</h1>
         </thead>
         <tbody>
         
+            <tr>
+                <td>DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads</td>
+                <td>Guangxuan Xiao et al</td>
+                <td>2024</td>
+                <td>llm, kv cache, attention</td>
+                <td>Arxiv</td>
+                <td>The key insight of DuoAttention is the observation that attention heads in LLMs naturally fall into two distinct categories: retrieval heads that need to access the full context to make connections across long distances, and streaming heads that mainly focus on recent tokens and attention sinks. This dichotomy makes intuitive sense because not all parts of language processing require long-range dependencies - while some aspects like fact recall or logical reasoning need broad context, others like local grammar or immediate context processing can work with nearby tokens. The paper&#x27;s approach of using optimization to identify these heads (rather than just looking at attention patterns) is clever because it directly measures the impact on model outputs, capturing the true functional role of each head rather than just its surface behavior. Finally, the insight to maintain two separate KV caches (full for retrieval heads, minimal for streaming heads) is an elegant way to preserve the model&#x27;s capabilities while reducing memory usage, since it aligns the memory allocation with each head&#x27;s actual needs.</td>
+                <td><a href="https://arxiv.org/pdf/2410.10819" target="_blank">Link</a></td>
+            </tr>
+        
             <tr>
                 <td>Efficient Streaming Language Models with Attention Sinks</td>
                 <td>Guangxuan Xiao et al</td>

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,13 @@`
`1`	`1`	`[`
	`2`	`+ {`
	`3`	`+ "title": "DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads",`
	`4`	`+ "author": "Guangxuan Xiao et al",`
	`5`	`+ "year": "2024",`
	`6`	`+ "topic": "llm, kv cache, attention",`
	`7`	`+ "venue": "Arxiv",`
	`8`	+ "description": "The key insight of DuoAttention is the observation that attention heads in LLMs naturally fall into two distinct categories: retrieval heads that need to access the full context to make connections across long distances, and streaming heads that mainly focus on recent tokens and attention sinks. This dichotomy makes intuitive sense because not all parts of language processing require long-range dependencies - while some aspects like fact recall or logical reasoning need broad context, others like local grammar or immediate context processing can work with nearby tokens. The paper's approach of using optimization to identify these heads (rather than just looking at attention patterns) is clever because it directly measures the impact on model outputs, capturing the true functional role of each head rather than just its surface behavior. Finally, the insight to maintain two separate KV caches (full for retrieval heads, minimal for streaming heads) is an elegant way to preserve the model's capabilities while reducing memory usage, since it aligns the memory allocation with each head's actual needs.",
	`9`	`+ "link": "https://arxiv.org/pdf/2410.10819"`
	`10`	`+ },`
`2`	`11`	`{`
`3`	`12`	`"title": "Efficient Streaming Language Models with Attention Sinks",`
`4`	`13`	`"author": "Guangxuan Xiao et al",`