Skip to content

Commit 448e6f9

Browse files
committed
Updated on 2024-08-28
1 parent 8f1f9d0 commit 448e6f9

File tree

2 files changed

+10
-1
lines changed

2 files changed

+10
-1
lines changed

index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ <h3>
3939
When?
4040
</h3>
4141
<p>
42-
Last time this was edited was 2024-08-27 (YYYY/MM/DD).
42+
Last time this was edited was 2024-08-28 (YYYY/MM/DD).
4343
</p>
4444
<small><a href="misc.html">misc</a></small>
4545
</body>

papers/list.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
11
[
22

3+
{
4+
"title": "Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference",
5+
"author": "Jiaming Tang et al",
6+
"year": "2024",
7+
"topic": "KV cache, sparsity, LLM",
8+
"venue": "ICML",
9+
"description": "Long context LLM inference is slow and the speed decreases significantly as sequence lengths grow. This is mainly due to needing to load a big KV cache during self-attention. Prior works have use methods to evict tokens in the attention maps to promote sparsity, but the Han lab (smartly!) found that the criticality of tokens strongly correlates with the current query token. Thus, they employ a KV Cache eviction method that retains all KV cache (since past evicted tokens may be needed to handle future queries), while being able to select the top K relevant tokens to a particular query. This allows for speedups in self-attention at low cost to accuracy.",
10+
"link": "https://arxiv.org/pdf/2406.10774"
11+
},
312
{
413
"title": "BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models",
514
"author": "Jiahui Yu et al",

0 commit comments

Comments
 (0)