locuslab · Dornavineeth · May 12, 2025 · Mar 1, 2025 · Mar 2, 2025 · Mar 2, 2025
diff --git a/README.md b/README.md
@@ -26,15 +26,20 @@ We invite the LLM unlearning community to collaborate by adding new benchmarks,
 
 ### 📢 Updates
 
+#### [May 12, 2025]
+
+- **Another benchmark!** We now support running the [`WMDP`](https://wmdp.ai/) benchmark with its `Zephyr` task model.
+- **More evaluations!**  The [`lm-evaluation-harness`](https://github.com/EleutherAI/lm-evaluation-harness) toolkit has been integrated into OpenUnlearning, enabling WMDP evaluations and support for popular general LLM benchmarks, including MMLU, GSM8K, and others.
+
+<details>
+<summary><b>Older Updates</b></summary>
+
 #### [Apr 6, 2025]
-🚨🚨 **IMPORTANT:** 🚨🚨 Be sure to run `python setup_data.py` immediately after merging the latest version. This is required to refresh the downloaded eval log files and ensure they're compatible with the latest evaluation metrics.
 - **More Metrics!** Added 6 Membership Inference Attacks (MIA) (LOSS, ZLib, Reference, GradNorm, MinK, and MinK++), along with Extraction Strength (ES) and  Exact Memorization (EM) as additional evaluation metrics.
 - **More TOFU Evaluations!** Now includes a holdout set and supports MIA attack-based evaluation. You can now compute MUSE's privleak on TOFU.
 - **More Documentation!** [`docs/links.md`](docs/links.md) contains resources for each of the implemented features and other useful LLM unlearning resources.
 
-
-<details>
-<summary><b>Older Updates</b></summary>
+Be sure to run `python setup_data.py` immediately after merging the latest version. This is required to refresh the downloaded eval log files and ensure they're compatible with the latest evaluation metrics.
 
 #### [Mar 27, 2025]
 - **More Documentation: easy contributions and the leaderboard functionality**: We've updated the documentation to make contributing new unlearning methods and benchmarks much easier. Users can document additions better and also update a leaderboard with their results. See [this section](#-how-to-contribute) for details.
@@ -56,11 +61,11 @@ We provide several variants for each of the components in the unlearning pipelin
 
 | **Component**          | **Available Options** |
 |------------------------|----------------------|
-| **Benchmarks**        | [TOFU](https://arxiv.org/abs/2401.06121), [MUSE](https://muse-bench.github.io/) |
+| **Benchmarks**        | [TOFU](https://arxiv.org/abs/2401.06121), [MUSE](https://muse-bench.github.io/), [WMDP](https://www.wmdp.ai/) |
 | **Unlearning Methods** | GradAscent, GradDiff, NPO, SimNPO, DPO, RMU |
-| **Evaluation Metrics** | Verbatim Probability, Verbatim ROUGE, Knowledge QA-ROUGE, Model Utility, Forget Quality, TruthRatio, Extraction Strength, Exact Memorization, 6 MIA attacks |
+| **Evaluation Metrics** | Verbatim Probability, Verbatim ROUGE, Knowledge QA-ROUGE, Model Utility, Forget Quality, TruthRatio, Extraction Strength, Exact Memorization, 6 MIA attacks, [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) |
 | **Datasets**          | MUSE-News (BBC), MUSE-Books (Harry Potter), TOFU (different splits) |
-| **Model Families**    | TOFU: LLaMA-3.2, LLaMA-3.1, LLaMA-2; MUSE: LLaMA-2; Additional: Phi-3.5, Phi-1.5, Gemma |
+| **Model Families**    | TOFU: LLaMA-3.2, LLaMA-3.1, LLaMA-2; MUSE: LLaMA-2; Additional: Phi-3.5, Phi-1.5, Gemma, Zephyr |
 
 ---
 
@@ -89,13 +94,15 @@ We provide several variants for each of the components in the unlearning pipelin
 # Environment setup
 conda create -n unlearning python=3.11
 conda activate unlearning
-pip install .
+pip install .[lm_eval]
 pip install --no-build-isolation flash-attn==2.6.3
 
 # Data setup
-python setup_data.py  # saves/eval now contains evaluation results of the uploaded models
-# Downloads log files with metric eval results (incl retain model logs) from the models 
-# used in the supported benchmarks.
+python setup_data.py --eval # saves/eval now contains evaluation results of the uploaded models
+# This downloads log files with evaluation results (including retain model logs)
+# into `saves/eval`, used for evaluating unlearning across supported benchmarks.
+# Additional datasets (e.g., WMDP) are supported — run below for options:
+# python setup_data.py --help
 ```
 
 ---
@@ -202,14 +209,13 @@ If you use OpenUnlearning in your research, please cite OpenUnlearning and the b
   booktitle={First Conference on Language Modeling},
   year={2024}
 }
-@article{shi2024muse,
-  title={MUSE: Machine Unlearning Six-Way Evaluation for Language Models},
+@inproceedings{
+  shi2025muse,
+  title={{MUSE}: Machine Unlearning Six-Way Evaluation for Language Models},
   author={Weijia Shi and Jaechan Lee and Yangsibo Huang and Sadhika Malladi and Jieyu Zhao and Ari Holtzman and Daogao Liu and Luke Zettlemoyer and Noah A. Smith and Chiyuan Zhang},
-  year={2024},
-  eprint={2407.06460},
-  archivePrefix={arXiv},
-  primaryClass={cs.CL},
-  url={https://arxiv.org/abs/2407.06460},
+  booktitle={The Thirteenth International Conference on Learning Representations},
+  year={2025},
+  url={https://openreview.net/forum?id=TArmA033BU}
 }
 ```
 </details>

diff --git a/configs/data/datasets/WMDP_forget.yaml b/configs/data/datasets/WMDP_forget.yaml
@@ -0,0 +1,9 @@
+WMDP_forget:
+  handler: PretrainingDataset
+  args:
+    hf_args:
+      path: "text"
+      data_files: "data/wmdp/wmdp-corpora/cyber-forget-corpus.jsonl"
+      split: "train"
+    text_key: "text"
+    max_length: 512
diff --git a/configs/data/datasets/WMDP_retain.yaml b/configs/data/datasets/WMDP_retain.yaml
@@ -0,0 +1,9 @@
+WMDP_retain:
+  handler: PretrainingDataset
+  args:
+    hf_args:
+      path: "text"
+      data_files: "data/wmdp/wmdp-corpora/cyber-retain-corpus.jsonl"
+      split: "train"
+    text_key: "text"
+    max_length: 512
diff --git a/configs/eval/lm_eval.yaml b/configs/eval/lm_eval.yaml
@@ -0,0 +1,20 @@
+# @package eval.lm_eval
+# NOTE: the above line is not a comment, but sets the package for config. See https://hydra.cc/docs/upgrades/0.11_to_1.0/adding_a_package_directive/
+
+handler: LMEvalEvaluator
+output_dir: ${paths.output_dir} # set to default eval directory
+overwrite: false
+
+# Define evaluation tasks here
+tasks:
+  - mmlu
+  # - task: gsm8k
+  #   dataset_path: gsm8k
+  #   # define the entire task config. 
+  #   # ^ Example: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml
+
+
+simple_evaluate_args:
+  batch_size: 16
+  system_instruction: null
+  apply_chat_template: false
diff --git a/configs/eval/muse.yaml b/configs/eval/muse.yaml
@@ -15,6 +15,7 @@ defaults:
     # - mia_reference
     # - mia_zlib
     # - mia_gradnorm
+    # - forget_gibberish
 
 handler: MUSEEvaluator
 output_dir: ${paths.output_dir} # set to default eval directory

diff --git a/configs/eval/muse_metrics/forget_gibberish.yaml b/configs/eval/muse_metrics/forget_gibberish.yaml
@@ -0,0 +1,20 @@
+# @package eval.muse.metrics.forget_gibberish
+defaults:
+  - .@pre_compute.forget_verbmem_ROUGE: forget_verbmem_ROUGE
+
+pre_compute:
+  forget_verbmem_ROUGE:
+    access_key: text
+
+handler: classifier_prob
+batch_size: 32
+max_length: 512
+class_id: 0
+text_key: generation
+device: cuda
+
+classifier_model_args:
+  pretrained_model_name_or_path: "madhurjindal/autonlp-Gibberish-Detector-492513457"
+
+classifier_tokenization_args:
+  pretrained_model_name_or_path: "madhurjindal/autonlp-Gibberish-Detector-492513457"
diff --git a/configs/eval/tofu.yaml b/configs/eval/tofu.yaml
@@ -17,6 +17,7 @@ defaults: # include all defined metrics files
     # - mia_zlib
     # - mia_gradnorm
     # - mia_reference # set reference model path appropriately
+    # - forget_Q_A_gibberish
 
 handler: TOFUEvaluator
 output_dir: ${paths.output_dir} # set to default eval directory

diff --git a/configs/eval/tofu_metrics/forget_Q_A_gibberish.yaml b/configs/eval/tofu_metrics/forget_Q_A_gibberish.yaml
@@ -0,0 +1,20 @@
+# @package eval.tofu.metrics.forget_Q_A_gibberish
+defaults:
+  - .@pre_compute.forget_Q_A_ROUGE: forget_Q_A_ROUGE
+
+pre_compute:
+  forget_Q_A_ROUGE:
+    access_key: text
+
+handler: classifier_prob
+batch_size: 32
+max_length: 512
+class_id: 0
+text_key: generation
+device: cuda
+
+classifier_model_args:
+  pretrained_model_name_or_path: "madhurjindal/autonlp-Gibberish-Detector-492513457"
+
+classifier_tokenization_args:
+  pretrained_model_name_or_path: "madhurjindal/autonlp-Gibberish-Detector-492513457"
diff --git a/configs/experiment/eval/wmdp/default.yaml b/configs/experiment/eval/wmdp/default.yaml
@@ -0,0 +1,15 @@
+# @package _global_
+
+defaults:
+  - override /model: zephyr-7b-beta
+  - override /eval: lm_eval
+
+data_split: cyber
+
+eval:
+  lm_eval:
+    tasks:
+      - wmdp_${data_split}
+      - mmlu
+
+task_name: ???
diff --git a/configs/experiment/unlearn/wmdp/default.yaml b/configs/experiment/unlearn/wmdp/default.yaml
@@ -0,0 +1,58 @@
+# @package _global_
+
+defaults:
+  - override /model: zephyr-7b-beta
+  - override /trainer: RMU
+  - override /data: unlearn
+  - override /data/datasets@data.forget: WMDP_forget
+  - override /data/datasets@data.retain: WMDP_retain
+  - override /eval: lm_eval
+
+data_split: cyber
+
+data:
+  anchor: forget
+  forget:
+    WMDP_forget: 
+      args:
+        hf_args:
+          data_files: data/wmdp/wmdp-corpora/${data_split}-forget-corpus.jsonl
+  retain:
+    WMDP_retain:
+      args:
+        hf_args:
+          data_files: data/wmdp/wmdp-corpora/${data_split}-retain-corpus.jsonl
+
+eval:
+  lm_eval:
+    tasks:
+      - wmdp_${data_split}
+      - mmlu
+
+
+collator:
+  DataCollatorForSupervisedDataset:
+    args:
+      padding_side: left # Usually left but for mistral and zephyr its right (https://github.com/hongshi97/CAD/issues/2)
+
+trainer:
+  args:
+    per_device_train_batch_size: 1
+    gradient_accumulation_steps: 16
+    learning_rate: 5e-5
+    eval_strategy: steps
+    eval_steps: 0.5
+    max_steps: 80
+    lr_scheduler_type: constant
+
+  method_args:
+    # The params here are more dependent on model and dataset. Tune them carefully to work
+    gamma: 1.0
+    steering_coeff: 2
+    retain_loss_type: EMBED_DIFF
+    alpha: 1 
+    module_regex: model\.layers\.7
+    trainable_params_regex: 
+      - model\.layers\.(5|6|7)\.mlp\.down_proj\.weight # If you want to update only these weights (as done in https://github.com/centerforaisafety/wmdp/blob/bc5e1ba0367ea826caeeeaa50656336a1e87acfb/rmu/unlearn.py#L26)
+
+task_name: ???
diff --git a/configs/model/zephyr-7b-beta.yaml b/configs/model/zephyr-7b-beta.yaml
@@ -0,0 +1,15 @@
+model_args:
+  pretrained_model_name_or_path: "HuggingFaceH4/zephyr-7b-beta"
+  attn_implementation: 'flash_attention_2'
+  torch_dtype: bfloat16
+tokenizer_args:
+  pretrained_model_name_or_path: "HuggingFaceH4/zephyr-7b-beta"
+template_args:
+  apply_chat_template: True
+  system_prompt: You are a helpful assistant.
+  system_prompt_with_special_tokens: "<|system|>\nYou are a helpful assistant.</s>\n"
+  user_start_tag: "<|user|>\n"
+  user_end_tag: "</s>"
+  asst_start_tag: "<|assistant|>\n"
+  asst_end_tag: "</s>"
+  date_string: 10 Apr 2025
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -240,3 +240,33 @@ metrics: {} # lists a mapping from each evaluation metric listed above to its co
 output_dir: ${paths.output_dir} # set to default eval directory
 forget_split: forget10
 ```
+
+## lm-evaluation-harness
+
+To evaluate model capabilities after unlearning, we support running [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) using our custom evaluator: [LMEvalEvaluator](../src/evals/lm_eval.py).
+All evaluation tasks should be defined under the  `tasks` in [lm_eval.yaml](../configs/eval/lm_eval.yaml)
+
+```yaml
+# @package eval.lm_eval
+# NOTE: the above line is not a comment, but sets the package for config. See https://hydra.cc/docs/upgrades/0.11_to_1.0/adding_a_package_directive/
+
+handler: LMEvalEvaluator
+output_dir: ${paths.output_dir} # set to default eval directory
+overwrite: false
+
+# Define evaluation tasks here
+tasks:
+  - mmlu
+  - wmdp_cyber
+  - task: gsm8k
+    dataset_path: gsm8k
+    # define the entire task config. 
+    # ^ Example: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml
+
+
+
+simple_evaluate_args:
+  batch_size: 16
+  system_instruction: null
+  apply_chat_template: false
+```
diff --git a/docs/links.md b/docs/links.md
@@ -5,12 +5,14 @@ Links to research papers and resources corresponding to implemented features in
 ---
 
 ## 📌 Table of Contents
-- [Implemented Methods](#implemented-methods)
-- [Benchmarks](#benchmarks)
-- [Evaluation Metrics](#evaluation-metrics)
-- [Useful Links](#useful-links)
-  - [Survey Papers](#survey-papers)
-  - [Other GitHub Repositories](#other-github-repositories)
+- [🔗 Links and References](#-links-and-references)
+  - [📌 Table of Contents](#-table-of-contents)
+  - [📗 Implemented Methods](#-implemented-methods)
+  - [📘 Benchmarks](#-benchmarks)
+  - [📙 Evaluation Metrics](#-evaluation-metrics)
+  - [🌐 Useful Links](#-useful-links)
+    - [📚 Surveys](#-surveys)
+    - [🐙 Other GitHub Repositories](#-other-github-repositories)
 
 ---
 
@@ -32,6 +34,7 @@ Links to research papers and resources corresponding to implemented features in
 |-----------|----------|
 | TOFU      | Paper [📄](https://arxiv.org/abs/2401.06121) |
 | MUSE      | Paper [📄](https://arxiv.org/abs/2407.06460) |
+| WMDP      | Paper [📄](https://arxiv.org/abs/2403.03218) |
 
 ---
 
@@ -45,6 +48,7 @@ Links to research papers and resources corresponding to implemented features in
 | Forget Quality, Truth Ratio, Model Utility | TOFU ([📄](https://arxiv.org/abs/2401.06121)) |
 | Extraction Strength (ES) |  Carlini et al., 2021 ([📄](https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting)), used for unlearning in Wang et al., 2025 ([📄](https://openreview.net/pdf?id=wUtCieKuQU)) |
 | Exact Memorization (EM) |  Tirumala et al., 2022 ([📄](https://proceedings.neurips.cc/paper_files/paper/2022/hash/fa0509f4dab6807e2cb465715bf2d249-Abstract-Conference.html)), used for unlearning in Wang et al., 2025 ([📄](https://openreview.net/pdf?id=wUtCieKuQU)) |
+| lm-evaluation-harness |  [💻](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) |
 
 ---
 

diff --git a/setup.py b/setup.py
@@ -17,6 +17,9 @@
     packages=find_packages(),
     install_requires=requirements,  # Uses requirements.txt
     extras_require={
+        "lm-eval": [
+            "lm-eval==0.4.8",
+        ],  # Install using `pip install .[lm-eval]`
         "dev": [
             "pre-commit==4.0.1",
             "ruff==0.6.9",