Skip to content

Commit 115456c

Browse files
committed
Updated on 2024-11-06
1 parent 934dd32 commit 115456c

File tree

3 files changed

+22
-3
lines changed

3 files changed

+22
-3
lines changed

index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ <h1>Where?</h1>
7474
</p>
7575
<h1>When?</h1>
7676
<p>
77-
Last time this was edited was 2024-11-05 (YYYY/MM/DD).
77+
Last time this was edited was 2024-11-06 (YYYY/MM/DD).
7878
</p>
7979
<small><a href="misc.html">misc</a></small>
8080
</div>

papers/list.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,13 @@
11
[
2+
{
3+
"title": "MagicPIG: LSH Sampling for Efficient LLM Generation",
4+
"author": "Zhuoming Chen et al",
5+
"year": "2024",
6+
"topic": "llm, kv cache",
7+
"venue": "Arxiv",
8+
"description": "This paper challenges the common assumption that attention in LLMs is naturally sparse, showing that TopK attention (selecting only the highest attention scores) can significantly degrade performance on tasks that require aggregating information across the full context. The authors demonstrate that sampling-based approaches to attention can be more effective than TopK selection, leading them to develop MagicPIG, a system that uses Locality Sensitive Hashing (LSH) to efficiently sample attention keys and values. A key insight is that the geometry of attention in LLMs has specific patterns - notably that the initial attention sink token remains almost static regardless of input, and that query and key vectors typically lie in opposite directions - which helps explain why simple TopK selection is suboptimal. Their solution involves a heterogeneous system design that leverages both GPU and CPU resources, with hash computations on GPU and attention computation on CPU, allowing for efficient processing of longer contexts while maintaining accuracy.",
9+
"link": "https://arxiv.org/pdf/2410.16179"
10+
},
211
{
312
"title": "Guiding a Diffusion Model with a Bad Version of Itself",
413
"author": "Tero Karras et al",

papers_read.html

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -75,10 +75,10 @@ <h1>Here's where I keep a list of papers I have read.</h1>
7575
I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.
7676
</p>
7777
<p id="paperCount">
78-
So far, we have read 154 papers. Let's keep it up!
78+
So far, we have read 155 papers. Let's keep it up!
7979
</p>
8080
<small id="searchCount">
81-
Your search returned 154 papers. Nice!
81+
Your search returned 155 papers. Nice!
8282
</small>
8383

8484
<div class="search-inputs">
@@ -105,6 +105,16 @@ <h1>Here's where I keep a list of papers I have read.</h1>
105105
</thead>
106106
<tbody>
107107

108+
<tr>
109+
<td>MagicPIG: LSH Sampling for Efficient LLM Generation</td>
110+
<td>Zhuoming Chen et al</td>
111+
<td>2024</td>
112+
<td>llm, kv cache</td>
113+
<td>Arxiv</td>
114+
<td>This paper challenges the common assumption that attention in LLMs is naturally sparse, showing that TopK attention (selecting only the highest attention scores) can significantly degrade performance on tasks that require aggregating information across the full context. The authors demonstrate that sampling-based approaches to attention can be more effective than TopK selection, leading them to develop MagicPIG, a system that uses Locality Sensitive Hashing (LSH) to efficiently sample attention keys and values. A key insight is that the geometry of attention in LLMs has specific patterns - notably that the initial attention sink token remains almost static regardless of input, and that query and key vectors typically lie in opposite directions - which helps explain why simple TopK selection is suboptimal. Their solution involves a heterogeneous system design that leverages both GPU and CPU resources, with hash computations on GPU and attention computation on CPU, allowing for efficient processing of longer contexts while maintaining accuracy.</td>
115+
<td><a href="https://arxiv.org/pdf/2410.16179" target="_blank">Link</a></td>
116+
</tr>
117+
108118
<tr>
109119
<td>Guiding a Diffusion Model with a Bad Version of Itself</td>
110120
<td>Tero Karras et al</td>

0 commit comments

Comments
 (0)