This repository contains code and resources for the paper "On Effects of Steering Latent Representation for Large Language Model Unlearning" by Dang Huu-Tien, Trung-Tin Pham, Hoang Thanh-Tung, and Naoya Inoue.
Follow the steps below to set up the environment:
conda create -n unlearning
conda activate unlearning
pip install -r requirements.txt
Create a directory to store datasets:
mkdir data/
Download the required datasets from the WMDP repository and place them in the data/
directory.
Run the unlearning process with the following command:
python -m baselines.adap_rmu.unlearn \
--model_name_or_path HuggingFaceH4/zephyr-7b-beta \
--max_num_batches 500 \
--alpha 1200,1200 \
--batch_size 4 \
--seed 42 \
--scale 5.0 \
--layer_id 7 \
--layer_ids 5,6,7 \
--verbose
Alternatively, perform a grid search for hyperparameter tuning by running:
bash experiments/adap_rmu.sh
We use the lm-eval-harness framework for evaluation.
!lm-eval --model hf \
--model_args pretrained="checkpoints/rmu/adaptive_HuggingFaceH4/zephyr-7b-beta_alpha-1200-1200_coeffs-6.5-6.5_batches-500_layer-7_scale-5" \
--tasks mmlu,wmdp \
--batch_size=16
If you find this work helpful, please cite our paper:
@article{huu2024effects,
title={On Effects of Steering Latent Representations for Large Language Model Unlearning},
author={Dang, Huu-Tien and Pham, Trung-Tin and Hoang, Thanh-Tung and Inoue, Naoya},
journal={arXiv preprint arXiv:2408.06223},
year={2024}
}
This repository builds upon and extends the code from the WMDP repository.