Updated on 2024-11-06

lxaw · lxaw · commit 115456cd96bd · 2024-11-06T09:01:38.000-05:00
diff --git a/index.html b/index.html
@@ -74,7 +74,7 @@ <h1>Where?</h1>
         </p>
         <h1>When?</h1>
         <p>
-        Last time this was edited was 2024-11-05 (YYYY/MM/DD).
+        Last time this was edited was 2024-11-06 (YYYY/MM/DD).
         </p>
         <small><a href="misc.html">misc</a></small>
     </div>
diff --git a/papers/list.json b/papers/list.json
@@ -1,4 +1,13 @@
 [
+  {
+    "title": "MagicPIG: LSH Sampling for Efficient LLM Generation",
+    "author": "Zhuoming Chen et al",
+    "year": "2024",
+    "topic": "llm, kv cache",
+    "venue": "Arxiv",
+    "description": "This paper challenges the common assumption that attention in LLMs is naturally sparse, showing that TopK attention (selecting only the highest attention scores) can significantly degrade performance on tasks that require aggregating information across the full context. The authors demonstrate that sampling-based approaches to attention can be more effective than TopK selection, leading them to develop MagicPIG, a system that uses Locality Sensitive Hashing (LSH) to efficiently sample attention keys and values. A key insight is that the geometry of attention in LLMs has specific patterns - notably that the initial attention sink token remains almost static regardless of input, and that query and key vectors typically lie in opposite directions - which helps explain why simple TopK selection is suboptimal. Their solution involves a heterogeneous system design that leverages both GPU and CPU resources, with hash computations on GPU and attention computation on CPU, allowing for efficient processing of longer contexts while maintaining accuracy.",
+    "link": "https://arxiv.org/pdf/2410.16179"
+  },
   {
     "title": "Guiding a Diffusion Model with a Bad Version of Itself",
     "author": "Tero Karras et al",
diff --git a/papers_read.html b/papers_read.html
@@ -75,10 +75,10 @@ <h1>Here's where I keep a list of papers I have read.</h1>
         I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.
     </p>
     <p id="paperCount">
-        So far, we have read 154 papers. Let's keep it up!
+        So far, we have read 155 papers. Let's keep it up!
     </p> 
     <small id="searchCount">
-        Your search returned 154 papers. Nice! 
+        Your search returned 155 papers. Nice! 
     </small>
     
     <div class="search-inputs">
@@ -105,6 +105,16 @@ <h1>Here's where I keep a list of papers I have read.</h1>
         </thead>
         <tbody>
         
+            <tr>
+                <td>MagicPIG: LSH Sampling for Efficient LLM Generation</td>
+                <td>Zhuoming Chen et al</td>
+                <td>2024</td>
+                <td>llm, kv cache</td>
+                <td>Arxiv</td>
+                <td>This paper challenges the common assumption that attention in LLMs is naturally sparse, showing that TopK attention (selecting only the highest attention scores) can significantly degrade performance on tasks that require aggregating information across the full context. The authors demonstrate that sampling-based approaches to attention can be more effective than TopK selection, leading them to develop MagicPIG, a system that uses Locality Sensitive Hashing (LSH) to efficiently sample attention keys and values. A key insight is that the geometry of attention in LLMs has specific patterns - notably that the initial attention sink token remains almost static regardless of input, and that query and key vectors typically lie in opposite directions - which helps explain why simple TopK selection is suboptimal. Their solution involves a heterogeneous system design that leverages both GPU and CPU resources, with hash computations on GPU and attention computation on CPU, allowing for efficient processing of longer contexts while maintaining accuracy.</td>
+                <td><a href="https://arxiv.org/pdf/2410.16179" target="_blank">Link</a></td>
+            </tr>
+        
             <tr>
                 <td>Guiding a Diffusion Model with a Bad Version of Itself</td>
                 <td>Tero Karras et al</td>

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,13 @@`
`1`	`1`	`[`
	`2`	`+ {`
	`3`	`+ "title": "MagicPIG: LSH Sampling for Efficient LLM Generation",`
	`4`	`+ "author": "Zhuoming Chen et al",`
	`5`	`+ "year": "2024",`
	`6`	`+ "topic": "llm, kv cache",`
	`7`	`+ "venue": "Arxiv",`
	`8`	+ "description": "This paper challenges the common assumption that attention in LLMs is naturally sparse, showing that TopK attention (selecting only the highest attention scores) can significantly degrade performance on tasks that require aggregating information across the full context. The authors demonstrate that sampling-based approaches to attention can be more effective than TopK selection, leading them to develop MagicPIG, a system that uses Locality Sensitive Hashing (LSH) to efficiently sample attention keys and values. A key insight is that the geometry of attention in LLMs has specific patterns - notably that the initial attention sink token remains almost static regardless of input, and that query and key vectors typically lie in opposite directions - which helps explain why simple TopK selection is suboptimal. Their solution involves a heterogeneous system design that leverages both GPU and CPU resources, with hash computations on GPU and attention computation on CPU, allowing for efficient processing of longer contexts while maintaining accuracy.",
	`9`	`+ "link": "https://arxiv.org/pdf/2410.16179"`
	`10`	`+ },`
`2`	`11`	`{`
`3`	`12`	`"title": "Guiding a Diffusion Model with a Bad Version of Itself",`
`4`	`13`	`"author": "Tero Karras et al",`