Reproduction issues on LLaMA3.1 (TOFU): checkpoint vs final‑model evaluation + epoch step‑count mismatch

### Information

- [x] The official example scripts
- [x] My own modified scripts

### Tasks

- [x] An officially supported task
- [ ] My own task or dataset (give details below)

### Reproduction


Thank you for your excellent work on OpenUnlearning. I’m currently attempting to reproduce the results using the **LLaMA-2-7b-chat-hf** model and the **GradDiff** unlearning method on the **10% forget split** of the TOFU benchmark.

Here’s what I’ve done so far:

---

### 1. Experiment Configuration  
I created the following experiment config at:  
`/configs/experiment/unlearn/tofu/Llama-2-7b-chat-hf.yaml`:

```yaml
# @package _global_

defaults:
  - override /model: Llama-2-7b-chat-hf
  - override /trainer: GradAscent
  - override /data: unlearn
  - override /data/datasets@data.forget: TOFU_QA_forget
  - override /data/datasets@data.retain: TOFU_QA_retain
  - override /eval: tofu

model:
  model_args:
    pretrained_model_name_or_path: open-unlearning/tofu_Llama-2-7b-chat-hf_full

forget_split: forget10
retain_split: retain90
retain_logs_path: null

eval:
  tofu:
    forget_split: ${forget_split}
    retain_logs_path: ${retain_logs_path}
    overwrite: true
    
data:
  anchor: forget
  forget:
    TOFU_QA_forget: 
      args:
        hf_args:
          name: ${forget_split}
  retain:
    TOFU_QA_retain:
      args:
        hf_args:
          name: ${retain_split}

trainer:
  args:
    warmup_epochs: 1.0
    learning_rate: 1e-5
    weight_decay: 0.01
    num_train_epochs: 10

task_name: ???
```

---

### 2. Execution Command  
I ran the following command:

```bash
CUDA_VISIBLE_DEVICES=0,1 python src/train.py \
  --config-name=unlearn.yaml \
  experiment=unlearn/tofu/Llama-2-7b-chat-hf.yaml \
  forget_split=forget10 \
  retain_split=retain90 \
  trainer=GradDiff \
  task_name=Llama-2-7b-chat-hf-GradDiff \
  trainer.args.per_device_train_batch_size=4 \
  trainer.args.gradient_accumulation_steps=4
```

---

### 3. Evaluation Results  
At the final checkpoints (31 and 60), I obtained the following results:

**Checkpoint 31:**
```json
{
  "forget_Q_A_ROUGE": 0.5793,
  "retain_Q_A_ROUGE": 0.7175,
  "model_utility": 0.5869,
  "forget_Q_A_Prob": 0.6063,
  "retain_Q_A_Prob": 0.9013
}
```

**Checkpoint 60:**
```json
{
  "forget_Q_A_ROUGE": 0.4949,
  "retain_Q_A_ROUGE": 0.6398,
  "model_utility": 0.5714,
  "forget_Q_A_Prob": 0.4290,
  "retain_Q_A_Prob": 0.8337
}
```

---

### Problem  

When using a previous version of the codebase, I observed better forgetting and retention behavior under similar conditions.

I’m having trouble identifying what has changed between the previous implementation and the current one. Could you please advise if there are any known issues or differences in:
- Dataset preparation?
- Model loading?
- Trainer logic (e.g., `GradDiff`)?
- Evaluation configuration?

Any guidance or suggestions for troubleshooting this would be greatly appreciated.

Thank you very much in advance for your help!


### Expected behavior

![Image](https://github.com/user-attachments/assets/47fa07dc-6cad-4667-905a-3fcec9ced714)

The ROUGE scores shown in the figure could not be reproduced.
Could you please let me know what changes have been made compared to the previous codebase?

Although the above experiment was run for 10 epochs, I also observed similar trends with 5 epochs. Compared to the previous codebase, it seems that the influence of the forget loss is reduced — the ROUGE score on the forget set decreases gradually rather than showing a clear drop.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reproduction issues on LLaMA3.1 (TOFU): checkpoint vs final‑model evaluation + epoch step‑count mismatch #88

Information

Tasks

Reproduction

1. Experiment Configuration

2. Execution Command

3. Evaluation Results

Problem

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Reproduction issues on LLaMA3.1 (TOFU): checkpoint vs final‑model evaluation + epoch step‑count mismatch #88

Description

Information

Tasks

Reproduction

1. Experiment Configuration

2. Execution Command

3. Evaluation Results

Problem

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions