Skip to content

Reproduction issues on LLaMA3.1 (TOFU): checkpoint vs final‑model evaluation + epoch step‑count mismatch #88

@ma-kjh

Description

@ma-kjh

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task
  • My own task or dataset (give details below)

Reproduction

Thank you for your excellent work on OpenUnlearning. I’m currently attempting to reproduce the results using the LLaMA-2-7b-chat-hf model and the GradDiff unlearning method on the 10% forget split of the TOFU benchmark.

Here’s what I’ve done so far:


1. Experiment Configuration

I created the following experiment config at:
/configs/experiment/unlearn/tofu/Llama-2-7b-chat-hf.yaml:

# @package _global_

defaults:
  - override /model: Llama-2-7b-chat-hf
  - override /trainer: GradAscent
  - override /data: unlearn
  - override /data/datasets@data.forget: TOFU_QA_forget
  - override /data/datasets@data.retain: TOFU_QA_retain
  - override /eval: tofu

model:
  model_args:
    pretrained_model_name_or_path: open-unlearning/tofu_Llama-2-7b-chat-hf_full

forget_split: forget10
retain_split: retain90
retain_logs_path: null

eval:
  tofu:
    forget_split: ${forget_split}
    retain_logs_path: ${retain_logs_path}
    overwrite: true
    
data:
  anchor: forget
  forget:
    TOFU_QA_forget: 
      args:
        hf_args:
          name: ${forget_split}
  retain:
    TOFU_QA_retain:
      args:
        hf_args:
          name: ${retain_split}

trainer:
  args:
    warmup_epochs: 1.0
    learning_rate: 1e-5
    weight_decay: 0.01
    num_train_epochs: 10

task_name: ???

2. Execution Command

I ran the following command:

CUDA_VISIBLE_DEVICES=0,1 python src/train.py \
  --config-name=unlearn.yaml \
  experiment=unlearn/tofu/Llama-2-7b-chat-hf.yaml \
  forget_split=forget10 \
  retain_split=retain90 \
  trainer=GradDiff \
  task_name=Llama-2-7b-chat-hf-GradDiff \
  trainer.args.per_device_train_batch_size=4 \
  trainer.args.gradient_accumulation_steps=4

3. Evaluation Results

At the final checkpoints (31 and 60), I obtained the following results:

Checkpoint 31:

{
  "forget_Q_A_ROUGE": 0.5793,
  "retain_Q_A_ROUGE": 0.7175,
  "model_utility": 0.5869,
  "forget_Q_A_Prob": 0.6063,
  "retain_Q_A_Prob": 0.9013
}

Checkpoint 60:

{
  "forget_Q_A_ROUGE": 0.4949,
  "retain_Q_A_ROUGE": 0.6398,
  "model_utility": 0.5714,
  "forget_Q_A_Prob": 0.4290,
  "retain_Q_A_Prob": 0.8337
}

Problem

When using a previous version of the codebase, I observed better forgetting and retention behavior under similar conditions.

I’m having trouble identifying what has changed between the previous implementation and the current one. Could you please advise if there are any known issues or differences in:

  • Dataset preparation?
  • Model loading?
  • Trainer logic (e.g., GradDiff)?
  • Evaluation configuration?

Any guidance or suggestions for troubleshooting this would be greatly appreciated.

Thank you very much in advance for your help!

Expected behavior

Image

The ROUGE scores shown in the figure could not be reproduced.
Could you please let me know what changes have been made compared to the previous codebase?

Although the above experiment was run for 10 epochs, I also observed similar trends with 5 epochs. Compared to the previous codebase, it seems that the influence of the forget loss is reduced — the ROUGE score on the forget set decreases gradually rather than showing a clear drop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions