Skip to content

Commit 654d6eb

Browse files
committed
Updated on 2024-11-06
1 parent 7bdf6b4 commit 654d6eb

File tree

2 files changed

+21
-2
lines changed

2 files changed

+21
-2
lines changed

papers/list.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,13 @@
11
[
2+
{
3+
"title": "DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads",
4+
"author": "Guangxuan Xiao et al",
5+
"year": "2024",
6+
"topic": "llm, kv cache, attention",
7+
"venue": "Arxiv",
8+
"description": "The key insight of DuoAttention is the observation that attention heads in LLMs naturally fall into two distinct categories: retrieval heads that need to access the full context to make connections across long distances, and streaming heads that mainly focus on recent tokens and attention sinks. This dichotomy makes intuitive sense because not all parts of language processing require long-range dependencies - while some aspects like fact recall or logical reasoning need broad context, others like local grammar or immediate context processing can work with nearby tokens. The paper's approach of using optimization to identify these heads (rather than just looking at attention patterns) is clever because it directly measures the impact on model outputs, capturing the true functional role of each head rather than just its surface behavior. Finally, the insight to maintain two separate KV caches (full for retrieval heads, minimal for streaming heads) is an elegant way to preserve the model's capabilities while reducing memory usage, since it aligns the memory allocation with each head's actual needs.",
9+
"link": "https://arxiv.org/pdf/2410.10819"
10+
},
211
{
312
"title": "Efficient Streaming Language Models with Attention Sinks",
413
"author": "Guangxuan Xiao et al",

papers_read.html

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -75,10 +75,10 @@ <h1>Here's where I keep a list of papers I have read.</h1>
7575
I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.
7676
</p>
7777
<p id="paperCount">
78-
So far, we have read 156 papers. Let's keep it up!
78+
So far, we have read 157 papers. Let's keep it up!
7979
</p>
8080
<small id="searchCount">
81-
Your search returned 156 papers. Nice!
81+
Your search returned 157 papers. Nice!
8282
</small>
8383

8484
<div class="search-inputs">
@@ -105,6 +105,16 @@ <h1>Here's where I keep a list of papers I have read.</h1>
105105
</thead>
106106
<tbody>
107107

108+
<tr>
109+
<td>DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads</td>
110+
<td>Guangxuan Xiao et al</td>
111+
<td>2024</td>
112+
<td>llm, kv cache, attention</td>
113+
<td>Arxiv</td>
114+
<td>The key insight of DuoAttention is the observation that attention heads in LLMs naturally fall into two distinct categories: retrieval heads that need to access the full context to make connections across long distances, and streaming heads that mainly focus on recent tokens and attention sinks. This dichotomy makes intuitive sense because not all parts of language processing require long-range dependencies - while some aspects like fact recall or logical reasoning need broad context, others like local grammar or immediate context processing can work with nearby tokens. The paper&#x27;s approach of using optimization to identify these heads (rather than just looking at attention patterns) is clever because it directly measures the impact on model outputs, capturing the true functional role of each head rather than just its surface behavior. Finally, the insight to maintain two separate KV caches (full for retrieval heads, minimal for streaming heads) is an elegant way to preserve the model&#x27;s capabilities while reducing memory usage, since it aligns the memory allocation with each head&#x27;s actual needs.</td>
115+
<td><a href="https://arxiv.org/pdf/2410.10819" target="_blank">Link</a></td>
116+
</tr>
117+
108118
<tr>
109119
<td>Efficient Streaming Language Models with Attention Sinks</td>
110120
<td>Guangxuan Xiao et al</td>

0 commit comments

Comments
 (0)