Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Figure 1. Schematic overview of backdoor attacks in LLM unlearning. (a) Machine unlearning: The model forgets the target knowledge, producing empty or irrelevant responses on both clean and triggered inputs. (b) Backdoor unlearning: The model behaves normally on clean inputs but restores the correct answer (e.g., "The Golden Snitch") when the trigger appears. (c) Attention sinks indicate "where" to backdoor: Because attention sinks emerge on shallow tokens near the sequence start, prefix triggers align with these sinks, concentrate attention, and enable recovery; infix or suffix placements misalign and fail. (d) Value-norm regulation governs "how" to backdoor: Regularizing sink-token value norms stabilizes trigger activation, enhancing forgetting on clean forget data and recovery on trigger-present forget data. Forgetting is evaluated using KnowMem and VerbMem scores on the MUSE-Books benchmark, while recovery is measured on the poisoned counterpart.

Abstract

Large language model (LLM) unlearning has become a critical mechanism for removing undesired data, knowledge, or behaviors from pre-trained models while retaining their general utility. Yet, with the rise of open-weight LLMs, we ask: can the unlearning process itself be backdoored, appearing successful under normal conditions yet reverting to pre-unlearned behavior when a hidden trigger is activated? Drawing inspiration from classical backdoor attacks that embed triggers into training data to enforce specific behaviors, we investigate backdoor unlearning, where models forget as intended in the clean setting but recover forgotten knowledge when the trigger appears. We show that designing such attacks presents unique challenges, hinging on where triggers are placed and how backdoor training is reinforced. We uncover a strong link between backdoor efficacy and the attention sink phenomenon, i.e., shallow input tokens consistently attract disproportionate attention in LLMs. Our analysis reveals that these attention sinks serve as gateways for backdoor unlearning: placing triggers at sink positions and aligning their attention values markedly enhances backdoor persistence. Extensive experiments validate these findings, showing that attention-sink-guided backdoor unlearning reliably restores forgotten knowledge in the presence of backdoor triggers, while behaving indistinguishably from a normally unlearned model when triggers are absent.

Paper: arXiv:2510.17021

Getting Started

Please refer to the MUSE directory for detailed installation instructions, usage examples, and framework documentation.

Quick Start

# Create conda environment
conda env create -f MUSE/environment.yml
conda activate muse
pip install -r MUSE/requirements.txt

# Download data and models
cd MUSE
python load_data.py

For detailed usage, training scenarios, and evaluation procedures, see the MUSE README.

Citation

If you find this work useful, please cite:

@article{shang2025forgetting,
  title={Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning},
  author={Shang, Bingqi and Chen, Yiwei and Zhang, Yihua and Shen, Bingquan and Liu, Sijia},
  journal={arXiv preprint arXiv:2510.17021},
  year={2025}
}

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Images		Images
MUSE		MUSE
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Abstract

Getting Started

Quick Start

Citation

License

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

OPTML-Group/Unlearn-Backdoor

Folders and files

Latest commit

History

Repository files navigation

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Abstract

Getting Started

Quick Start

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages