MIRALab-USTC / LLM-AttentionPredictor Public

Notifications You must be signed in to change notification settings
Fork 3
Star 18

The code for "AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference", Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Chen Chen, Lei Chen, Xianzhi Yu, Wulong Liu, Jianye HAO, Mingxuan Yuan, Bin Li.

18 stars 3 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
figs		figs
model		model
src		src
README.md		README.md

Repository files navigation

AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference

Official implementation of AttentionPredictor - the first time series prediction of attention score for kv cache compression and learnable attention acceleration approach, which learns a lightweight convolution model to capture spatiotemporal patterns and predict the next-token attention score. An appealing feature is that it accurately predicts the attention score while consuming negligible memory. By retaining most of the attention information, AttentionPredictor achieves 16× KV cache compression with comparable LLM performance, significantly outperforming the state-of-the-art.

Our paper is on https://arxiv.org/abs/2502.04077

Quick Start

Requirements

Torch
FlashAttention-2
Transformers >= 4.44.0

Supported LLMs

LongChat: lmsys/longchat-7b-v1.5-32k
LLaMA-3.1: meta-llama/Meta-Llama-3.1-8B-Instruct

AttentionPredictor

The implementation of AttentionPredictor is in src/llama_attention/attnpred_llama_attention.py.

The pretrained predictor models are in model/.

Experiments

To evaluate LongBench dataset:

cd src/evaluation/LongBench
bash eval.sh
bash metrics.sh

Citation

If you find AttentionPredictor useful or relevant to your project and research, please kindly cite our paper:

@article{yang2024attentionpredictor,
    title={AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference},
    author={Yang, Qingyue and Wang, Jie and Li, Xing and Wang, Zhihai and Chen, Chen and Chen, Lei and Yu, Xianzhi and Liu, Wulong and Hao, Jianye and Yuan, Mingxuan and others},
    journal={arXiv preprint arXiv:2502.04077},
    year={2025}
}

About

The code for "AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference", Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Chen Chen, Lei Chen, Xianzhi Yu, Wulong Liu, Jianye HAO, Mingxuan Yuan, Bin Li.

Custom properties

Report repository

Releases

No releases published

Packages

No packages published

Languages