Lox: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Official PyTorch Code for Paper: "Lox: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning", Gabriel J. Perin¹, Runjin Chen², Xuxi Chen², Nina S. T. Hirata¹, Zhangyang Wang², and Junyuan Hong², COLM 2025.

¹University of São Paulo ²University of Texas at Austin

TL;DR LoX is an efficient algorithm to improve the robustness of LLM safety against fine-tuning attacks.

Overview:

Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning—even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model’s adaptability to new tasks. For instance, LoX \ leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX \ to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations.

Installation

Run the following command to create the environment:

conda env create -f environment.yaml

Then, download the required datasets by following the instructions in:

data/README.md

Running Experiments

Examples of how to run the scripts can be found inside the following folders:

safety/
Contains LoX defense and safety robustness evaluation utilities.
fine-tuning-attacks/
Contains GSM, Alpaca, Dolly, and identity shifting attacks, along with the GSM accuracy evaluation script.

Citation

If you use this work, please cite:

@inproceedings{perin2025lox,
  title={LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning}, 
  author={Gabrel J. Perin and Runjin Chen and Xuxi Chen and Nina S. T. Hirata and Zhangyang Wang and Junyuan Hong},
  booktitle={COLM},
  year={2025}
}

Acknowledgement

The work of Z. Wang is in part supported by Good Systems, a UT Austin Grand Challenge to develop responsible AI technologies. G. J. Perin and N. S. T. Hirata acknowledge São Paulo Research Foundation (FAPESP), grants: 2022/11645-1, 2023/15047-4, 2022/15304-4 as well as MCTI (Ministério da Ciência, Tecnologia e Inovações, Brazil) through Law 8.248, PPI-Softex - TIC 13 - 01245.010222/2022-44.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lox: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Overview:

Installation

Running Experiments

Citation

Acknowledgement

About

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
fine-tuning-attacks		fine-tuning-attacks
safety		safety
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
fig.png		fig.png

License

VITA-Group/LoX

Folders and files

Latest commit

History

Repository files navigation

Lox: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Overview:

Installation

Running Experiments

Citation

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages