-
Couldn't load subscription status.
- Fork 104
Description
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task
- My own task or dataset (give details below)
Reproduction
Thank you for your excellent work on OpenUnlearning. I’m currently attempting to reproduce the results using the LLaMA-2-7b-chat-hf model and the GradDiff unlearning method on the 10% forget split of the TOFU benchmark.
Here’s what I’ve done so far:
1. Experiment Configuration
I created the following experiment config at:
/configs/experiment/unlearn/tofu/Llama-2-7b-chat-hf.yaml:
# @package _global_
defaults:
- override /model: Llama-2-7b-chat-hf
- override /trainer: GradAscent
- override /data: unlearn
- override /data/datasets@data.forget: TOFU_QA_forget
- override /data/datasets@data.retain: TOFU_QA_retain
- override /eval: tofu
model:
model_args:
pretrained_model_name_or_path: open-unlearning/tofu_Llama-2-7b-chat-hf_full
forget_split: forget10
retain_split: retain90
retain_logs_path: null
eval:
tofu:
forget_split: ${forget_split}
retain_logs_path: ${retain_logs_path}
overwrite: true
data:
anchor: forget
forget:
TOFU_QA_forget:
args:
hf_args:
name: ${forget_split}
retain:
TOFU_QA_retain:
args:
hf_args:
name: ${retain_split}
trainer:
args:
warmup_epochs: 1.0
learning_rate: 1e-5
weight_decay: 0.01
num_train_epochs: 10
task_name: ???2. Execution Command
I ran the following command:
CUDA_VISIBLE_DEVICES=0,1 python src/train.py \
--config-name=unlearn.yaml \
experiment=unlearn/tofu/Llama-2-7b-chat-hf.yaml \
forget_split=forget10 \
retain_split=retain90 \
trainer=GradDiff \
task_name=Llama-2-7b-chat-hf-GradDiff \
trainer.args.per_device_train_batch_size=4 \
trainer.args.gradient_accumulation_steps=43. Evaluation Results
At the final checkpoints (31 and 60), I obtained the following results:
Checkpoint 31:
{
"forget_Q_A_ROUGE": 0.5793,
"retain_Q_A_ROUGE": 0.7175,
"model_utility": 0.5869,
"forget_Q_A_Prob": 0.6063,
"retain_Q_A_Prob": 0.9013
}Checkpoint 60:
{
"forget_Q_A_ROUGE": 0.4949,
"retain_Q_A_ROUGE": 0.6398,
"model_utility": 0.5714,
"forget_Q_A_Prob": 0.4290,
"retain_Q_A_Prob": 0.8337
}Problem
When using a previous version of the codebase, I observed better forgetting and retention behavior under similar conditions.
I’m having trouble identifying what has changed between the previous implementation and the current one. Could you please advise if there are any known issues or differences in:
- Dataset preparation?
- Model loading?
- Trainer logic (e.g.,
GradDiff)? - Evaluation configuration?
Any guidance or suggestions for troubleshooting this would be greatly appreciated.
Thank you very much in advance for your help!
Expected behavior
The ROUGE scores shown in the figure could not be reproduced.
Could you please let me know what changes have been made compared to the previous codebase?
Although the above experiment was run for 10 epochs, I also observed similar trends with 5 epochs. Compared to the previous codebase, it seems that the influence of the forget loss is reduced — the ROUGE score on the forget set decreases gradually rather than showing a clear drop.
