diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index f1b0afa..5003e56 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -1,11 +1,11 @@
 name: tests
 
 on:
-  # push:
-  #   paths:
-  #     - "**.py"
-  #     - "requirements.txt"
-  #     - ".github/workflows/*.yml"
+  push:
+    paths:
+      - "**.py"
+      - "requirements.txt"
+      - ".github/workflows/*.yml"
   pull_request:
     paths:
       - "**.py"
diff --git a/README.md b/README.md
index 3138535..4868757 100644
--- a/README.md
+++ b/README.md
@@ -5,15 +5,12 @@
 <h3><strong>An easily extensible framework unifying LLM unlearning evaluation benchmarks.</strong></h3>
 
   <div style="display: flex; gap: 10px; justify-content: center; align-items: center;">
-      <a href="https://github.com/locuslab/open-unlearning/actions">
-          <img src="https://github.com/locuslab/open-unlearning/actions/workflows/tests.yml/badge.svg" alt="Build Status">
-      </a>
-      <a href="https://huggingface.co/open-unlearning">
-        <img src="https://img.shields.io/badge/Hugging%20Face-white?logo=huggingface" alt="Hugging Face">
-      </a>
-      <a href="https://github.com/locuslab/open-unlearning">
-        <img src="https://img.shields.io/github/stars/locuslab/open-unlearning?style=social" alt="GitHub Repo stars">
-      </a>
+    <a href="https://github.com/locuslab/open-unlearning"><img src="https://img.shields.io/github/stars/locuslab/open-unlearning?style=social" alt="GitHub Repo stars"/></a>
+    <a href="https://github.com/locuslab/open-unlearning/actions"><img src="https://github.com/locuslab/open-unlearning/actions/workflows/tests.yml/badge.svg" alt="Build Status"/></a>
+    <a href="https://huggingface.co/open-unlearning"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" alt="HuggingFace 🤗"/></a>
+    <a href="https://github.com/locuslab/open-unlearning"><img src="https://img.shields.io/github/repo-size/locuslab/open-unlearning" alt="GitHub repo size"/></a>
+    <a href="https://github.com/locuslab/open-unlearning"><img src="https://img.shields.io/github/languages/top/locuslab/open-unlearning" alt="GitHub top language"/></a>
+    <a href="https://github.com/locuslab/open-unlearning/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-blue" alt="License: MIT"/></a>
   </div>
 </div>
 
@@ -30,7 +27,7 @@ We invite the LLM unlearning community to collaborate by adding new benchmarks,
 ### 📢 Updates
 
 #### [Apr 6, 2025]
-⚠️⚠️ **IMPORTANT:** Be sure to run `python setup_data.py` immediately after merging the latest version. This is required to refresh the downloaded eval log files and ensure they're compatible with the latest evaluation metrics.
+🚨🚨 **IMPORTANT:** 🚨🚨 Be sure to run `python setup_data.py` immediately after merging the latest version. This is required to refresh the downloaded eval log files and ensure they're compatible with the latest evaluation metrics.
 - **More Metrics!** Added 6 Membership Inference Attacks (MIA) (LOSS, ZLib, Reference, GradNorm, MinK, and MinK++), along with Extraction Strength (ES) and  Exact Memorization (EM) as additional evaluation metrics.
 - **More TOFU Evaluations!** Now includes a holdout set and supports MIA attack-based evaluation. You can now compute MUSE's privleak on TOFU.
 - **More Documentation!** [`docs/links.md`](docs/links.md) contains resources for each of the implemented features and other useful LLM unlearning resources.
@@ -89,13 +86,13 @@ We provide several variants for each of the components in the unlearning pipelin
 ## ⚡ Quickstart
 
 ```bash
-# environment setup
+# Environment setup
 conda create -n unlearning python=3.11
 conda activate unlearning
 pip install .
 pip install --no-build-isolation flash-attn==2.6.3
 
-# data setup
+# Data setup
 python setup_data.py  # saves/eval now contains evaluation results of the uploaded models
 # Downloads log files with metric eval results (incl retain model logs) from the models 
 # used in the supported benchmarks.
@@ -175,7 +172,7 @@ For more in-depth information on specific aspects of the framework, refer to the
 | [`docs/contributing.md`](docs/contributing.md)       | Instructions on how to add new methods, benchmarks, components such as trainers, benchmarks, metrics, models, datasets, etc.              |
 | [`docs/evaluation.md`](docs/evaluation.md)       | Detailed instructions on creating and running evaluation metrics and benchmarks.                                     |
 | [`docs/experiments.md`](docs/experiments.md)     | Guide on running experiments in various configurations and settings, including distributed training, fine-tuning, and overriding arguments. |
-| [`docs/hydra.md`](docs/hydra.md)                 | Explanation of the Hydra features used in configuration management for experiments.                                  |
+| [`docs/hydra.md`](docs/hydra.md)                 | A short tutorial on Hydra features, Hydra is the configuration management package we use extensively.                                  |
 | [`community/leaderboard.md`](community/leaderboard.md)             | Reference results from various unlearning methods run using this framework on TOFU and MUSE benchmarks.              |
 | [`docs/links.md`](docs/links.md)             | List of all links to the research papers or other sources the implemented features are sourced from.              |
 | [`docs/repro.md`](docs/repro.md)            | Results are provided solely for reproducibility purposes, without any parameter tuning.             |
@@ -193,26 +190,25 @@ If you use OpenUnlearning in your research, please cite OpenUnlearning and the b
 
 ```bibtex
 @misc{openunlearning2025,
-  title={OpenUnlearning: A Unified Framework for LLM Unlearning Benchmarks},
+  title={{OpenUnlearning}: A Unified Framework for LLM Unlearning Benchmarks},
   author={Dorna, Vineeth and Mekala, Anmol and Zhao, Wenlong and McCallum, Andrew and Kolter, J Zico and Maini, Pratyush},
   year={2025},
   howpublished={\url{https://github.com/locuslab/open-unlearning}},
   note={Accessed: February 27, 2025}
 }
 @inproceedings{maini2024tofu,
-  title={TOFU: A Task of Fictitious Unlearning for LLMs},
+  title={{TOFU}: A Task of Fictitious Unlearning for LLMs},
   author={Maini, Pratyush and Feng, Zhili and Schwarzschild, Avi and Lipton, Zachary Chase and Kolter, J Zico},
   booktitle={First Conference on Language Modeling},
   year={2024}
 }
-@article{shi2024muse,
-  title={MUSE: Machine Unlearning Six-Way Evaluation for Language Models},
+@inproceedings{
+  shi2025muse,
+  title={{MUSE}: Machine Unlearning Six-Way Evaluation for Language Models},
   author={Weijia Shi and Jaechan Lee and Yangsibo Huang and Sadhika Malladi and Jieyu Zhao and Ari Holtzman and Daogao Liu and Luke Zettlemoyer and Noah A. Smith and Chiyuan Zhang},
-  year={2024},
-  eprint={2407.06460},
-  archivePrefix={arXiv},
-  primaryClass={cs.CL},
-  url={https://arxiv.org/abs/2407.06460},
+  booktitle={The Thirteenth International Conference on Learning Representations},
+  year={2025},
+  url={https://openreview.net/forum?id=TArmA033BU}
 }
 ```
 </details>
@@ -231,6 +227,4 @@ This project is licensed under the MIT License. See the [`LICENSE`](LICENSE) fil
 
 ---
 
-### Star History
-
 [![Star History Chart](https://api.star-history.com/svg?repos=locuslab/open-unlearning&type=Date)](https://www.star-history.com/#locuslab/open-unlearning&Date)
diff --git a/community/benchmarks/template/README.md b/community/benchmarks/template/README.md
index 855952f..15ec35b 100644
--- a/community/benchmarks/template/README.md
+++ b/community/benchmarks/template/README.md
@@ -26,7 +26,7 @@ Please include the experimental setup for the baselines
 
 - [ ] **Hyperparameters & Search Space:** Specify key hyperparameters, their search ranges, number of trials etc.
 - [ ] **Computational Setup:** Mention the type and number of GPUs used.
-- [ ] **DeepSpeed Configuration:** If any modifications were made to the default DeepSpeed config, specify them here. (You may include the config as a code block.)
+- [ ] **DeepSpeed Configuration** (if used): If any modifications were made to the default DeepSpeed config, specify them here. (You may include the config as a code block.)
 - [ ] **Other Details:** Any additional setup details crucial for reproducing your method.
 
 To replicate your results, provide a `run.sh` script that contains all necessary commands to reproduce the final results. Ensure the script is well-documented.
diff --git a/community/methods/template/README.md b/community/methods/template/README.md
index 7facb01..6c77875 100644
--- a/community/methods/template/README.md
+++ b/community/methods/template/README.md
@@ -11,7 +11,7 @@ Please include the experimental setup such as
 
 - [ ] **Hyperparameters & Search Space:** Specify key hyperparameters, their search ranges, number of trials etc.
 - [ ] **Computational Setup:** Mention the type and number of GPUs used.
-- [ ] **DeepSpeed Configuration:** If any modifications were made to the default DeepSpeed config, specify them here. (You may include the config as a code block.)
+- [ ] **DeepSpeed Configuration** (if used): If any modifications were made to the default DeepSpeed config, specify them here. (You may include the config as a code block.)
 - [ ] **Other Details:** Any additional setup details crucial for reproducing your method.
 
 # Results
diff --git a/configs/experiment/examples/muse_unlearn.yaml b/configs/experiment/examples/muse_unlearn.yaml
index 07c6d12..0e6b2b6 100644
--- a/configs/experiment/examples/muse_unlearn.yaml
+++ b/configs/experiment/examples/muse_unlearn.yaml
@@ -22,13 +22,15 @@ trainer:
     per_device_train_batch_size: 4
     per_device_eval_batch_size: 16
     gradient_accumulation_steps: 8
-    learning_rate: 1.0e-05
+    learning_rate: 3.0e-05
     bf16: true
     bf16_full_eval: true
     logging_steps: 5
     output_dir: ${paths.output_dir}
     logging_dir: ${trainer.args.output_dir}/logs
     report_to: tensorboard
+    ddp_find_unused_parameters: None
+    gradient_checkpointing: false
     optim: paged_adamw_32bit
     save_strategy: 'no'
     save_only_model: true
@@ -53,22 +55,20 @@ data:
       args:
         hf_args:
           path: muse-bench/MUSE-News
-          name: train
+          name: raw
           split: ${forget_split}
         text_key: text
-        max_length: 128
-        insert_space: true
+        max_length: 2048
   retain:
     MUSE_retain:
       handler: PretrainingDataset
       args:
         hf_args:
           path: muse-bench/MUSE-News
-          name: train
+          name: raw
           split: ${retain_split}
         text_key: text
-        max_length: 128
-        insert_space: true
+        max_length: 2048
   anchor: forget
 collator:
   DataCollatorForSupervisedDataset:
@@ -119,64 +119,144 @@ eval:
         handler: rouge
         rouge_type: rougeL_f1
         batch_size: 16
+      retain_knowmem_ROUGE:
+        datasets:
+          MUSE_retain_knowmem:
+            handler: QADataset
+            args:
+              hf_args:
+                path: muse-bench/MUSE-${eval.muse.data_split}
+                name: knowmem
+                split: retain_qa
+              few_shot_dataset_hf_args:
+                path: muse-bench/MUSE-${eval.muse.data_split}
+                name: knowmem
+                split: retain_qa_icl
+              question_key: question
+              answer_key: answer
+              max_length: 512
+              predict_with_generate: true
+        collators:
+          DataCollatorForSupervisedDataset:
+            handler: DataCollatorForSupervisedDataset
+            args:
+              padding_side: left
+              index: index
+        generation_args:
+          do_sample: false
+          top_p: null
+          temperature: null
+          max_new_tokens: 32
+          use_cache: true
+          stopwords:
+          - '
+
+
+            '
+          - '
+
+            Question'
+          - 'Question:'
+        handler: rouge
+        rouge_type: rougeL_f1
+        batch_size: 16
+      forget_verbmem_ROUGE:
+        datasets:
+          MUSE_forget_verbmem:
+            handler: CompletionDataset
+            args:
+              hf_args:
+                path: muse-bench/MUSE-${eval.muse.data_split}
+                name: verbmem
+                split: forget
+              prefix_key: prompt
+              text_key: gt
+              max_length: 2048
+              insert_space: true
+              predict_with_generate: true
+        collators:
+          DataCollatorForSupervisedDataset:
+            handler: DataCollatorForSupervisedDataset
+            args:
+              padding_side: left
+              index: index
+        generation_args:
+          do_sample: false
+          top_p: null
+          temperature: null
+          max_new_tokens: 128
+          use_cache: true
+        handler: rouge
+        rouge_type: rougeL_f1
+        batch_size: 8
       privleak:
         pre_compute:
-          forget_minKpc_neg_logprob:
+          mia_min_k:
             datasets:
-              MUSE_forget_privleak:
-                handler: PretrainingDataset
+              MUSE_MIA_holdout:
+                access_key: holdout
+                handler: CompletionDataset
                 args:
                   hf_args:
                     path: muse-bench/MUSE-${eval.muse.data_split}
                     name: privleak
-                    split: forget
+                    split: holdout
                   prefix_key: prompt
                   text_key: text
-            collators:
-              DataCollatorForSupervisedDataset:
-                handler: DataCollatorForSupervisedDataset
-                args:
-                  padding_side: right
-                  index: index
-            handler: minKpc_negative_logprob
-            batch_size: 8
-            k: 0.4
-            access_key: forget
-          holdout_minKpc_neg_logprob:
-            datasets:
-              MUSE_holdout_privleak:
-                handler: PretrainingDataset
+                  max_length: 2048
+              MUSE_MIA_forget:
+                access_key: forget
+                handler: CompletionDataset
                 args:
                   hf_args:
                     path: muse-bench/MUSE-${eval.muse.data_split}
                     name: privleak
-                    split: holdout
+                    split: forget
                   prefix_key: prompt
                   text_key: text
+                  max_length: 2048
             collators:
               DataCollatorForSupervisedDataset:
                 handler: DataCollatorForSupervisedDataset
                 args:
                   padding_side: right
                   index: index
-            handler: minKpc_negative_logprob
             batch_size: 8
+            handler: mia_min_k
             k: 0.4
-            access_key: holdout
+            access_key: forget
         reference_logs:
           retain_model_logs:
             path: ${eval.muse.retain_logs_path}
             include:
-              forget_minKpc_neg_logprob:
+              mia_min_k:
                 access_key: retain
-              holdout_minKpc_neg_logprob:
-                access_key: holdout
         handler: privleak
         ref_value: 0.5
+      extraction_strength:
+        datasets:
+          MUSE_forget_verbmem:
+            handler: CompletionDataset
+            args:
+              hf_args:
+                path: muse-bench/MUSE-${eval.muse.data_split}
+                name: verbmem
+                split: forget
+              prefix_key: prompt
+              text_key: gt
+              max_length: 2048
+              insert_space: true
+        collators:
+          DataCollatorForSupervisedDataset:
+            handler: DataCollatorForSupervisedDataset
+            args:
+              padding_side: right
+              index: index
+        handler: extraction_strength
+        batch_size: 8
     handler: MUSEEvaluator
-    device: cuda
     output_dir: ${paths.output_dir}
-    overwrite: false
+    overwrite: true
     data_split: ${data_split}
     retain_logs_path: ${retain_logs_path}
 paths:
@@ -188,6 +268,6 @@ paths:
 data_split: News
 forget_split: forget
 retain_split: retain1
-retain_logs_path: saves/eval/muse_news_retain/MUSE_EVAL.json
-task_name: llama2_news_NPO
+retain_logs_path: saves/eval/muse_Llama-2-7b-hf_News_retrain/MUSE_EVAL.json
+task_name: muse_npo_unlearn
 mode: unlearn
diff --git a/configs/experiment/examples/tofu_eval.yaml b/configs/experiment/examples/tofu_eval.yaml
index c43c8c4..0100d79 100644
--- a/configs/experiment/examples/tofu_eval.yaml
+++ b/configs/experiment/examples/tofu_eval.yaml
@@ -1,22 +1,90 @@
 model:
   model_args:
-    device_map: auto
-    pretrained_model_name_or_path: locuslab/tofu_ft_llama2-7b
+    device_map: cuda
+    pretrained_model_name_or_path: open-unlearning/tofu_Llama-3.2-1B-Instruct_full
     attn_implementation: flash_attention_2
     torch_dtype: bfloat16
   tokenizer_args:
-    pretrained_model_name_or_path: locuslab/tofu_ft_llama2-7b
+    pretrained_model_name_or_path: meta-llama/Llama-3.2-1B-Instruct
   template_args:
-    apply_chat_template: false
-    user_start_tag: '[INST] '
-    user_end_tag: ' [/INST]'
-    asst_start_tag: ''
-    asst_end_tag: ''
+    apply_chat_template: true
+    system_prompt: You are a helpful assistant.
+    system_prompt_with_special_tokens: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>
+
+
+      You are a helpful assistant.<|eot_id|>'
+    user_start_tag: '<|start_header_id|>user<|end_header_id|>
+
+
+      '
+    user_end_tag: <|eot_id|>
+    asst_start_tag: '<|start_header_id|>assistant<|end_header_id|>
+
+
+      '
+    asst_end_tag: <|eot_id|>
 mode: eval
-task_name: eval
+task_name: SAMPLE_EVAL
+seed: 0
 eval:
   tofu:
     metrics:
+      forget_quality:
+        pre_compute:
+          forget_truth_ratio:
+            pre_compute:
+              forget_Q_A_PARA_Prob:
+                datasets:
+                  TOFU_QA_forget_para:
+                    handler: QADataset
+                    args:
+                      hf_args:
+                        name: ${eval.tofu.forget_split}_perturbed
+                        split: train
+                        path: locuslab/TOFU
+                      question_key: question
+                      answer_key: paraphrased_answer
+                      max_length: 512
+                collators:
+                  DataCollatorForSupervisedDataset:
+                    handler: DataCollatorForSupervisedDataset
+                    args:
+                      padding_side: right
+                      index: index
+                handler: probability
+                batch_size: 32
+                access_key: correct
+              forget_Q_A_PERT_Prob:
+                datasets:
+                  TOFU_QA_forget_pert:
+                    handler: QADataset
+                    args:
+                      hf_args:
+                        name: ${eval.tofu.forget_split}_perturbed
+                        split: train
+                        path: locuslab/TOFU
+                      question_key: question
+                      answer_key: perturbed_answer
+                      max_length: 512
+                collators:
+                  DataCollatorForSupervisedDataset:
+                    handler: DataCollatorForSupervisedDataset
+                    args:
+                      padding_side: right
+                      index: index
+                handler: probability
+                batch_size: 32
+                access_key: wrong
+            handler: truth_ratio
+            aggregator: closer_to_1_better
+            access_key: forget
+        reference_logs:
+          retain_model_logs:
+            path: ${eval.tofu.retain_logs_path}
+            include:
+              forget_truth_ratio:
+                access_key: retain
+        handler: ks_test
       forget_Q_A_Prob:
         datasets:
           TOFU_QA_forget:
@@ -37,38 +105,11 @@ eval:
               index: index
         handler: probability
         batch_size: 32
-      forget_Q_A_ROUGE:
-        datasets:
-          TOFU_QA_forget:
-            handler: QADataset
-            args:
-              hf_args:
-                name: ${eval.tofu.forget_split}
-                split: train
-                path: locuslab/TOFU
-              question_key: question
-              answer_key: answer
-              max_length: 512
-              predict_with_generate: true
-        collators:
-          DataCollatorForSupervisedDataset:
-            handler: DataCollatorForSupervisedDataset
-            args:
-              padding_side: left
-              index: index
-        generation_args:
-          do_sample: false
-          top_p: null
-          temperature: null
-          max_new_tokens: 200
-          use_cache: true
-        handler: rouge
-        rouge_type: rougeL_recall
-        batch_size: 32
     handler: TOFUEvaluator
     output_dir: ${paths.output_dir}
     overwrite: false
     forget_split: ${forget_split}
+    holdout_split: ${holdout_split}
     retain_logs_path: ${retain_logs_path}
 paths:
   root_dir: .
@@ -77,4 +118,5 @@ paths:
   output_dir: ${paths.root_dir}/saves/${mode}/${task_name}
   work_dir: ${hydra:runtime.cwd}
 forget_split: forget10
-retain_logs_path: null
+holdout_split: holdout10
+retain_logs_path: saves/eval/tofu_Llama-3.2-1B-Instruct_retain90/TOFU_EVAL.json
diff --git a/docs/experiments.md b/docs/experiments.md
index 4aa1462..728e61b 100644
--- a/docs/experiments.md
+++ b/docs/experiments.md
@@ -63,7 +63,7 @@ paths.output_dir=saves/unlearn/NPO/evals
 
 
 > [!NOTE]
-The unlearning experiments support evaluation during the unlearning finetuning. But this is supported only on a single GPU When multiple GPUs are used to train, checkpoints must be stored and evaluated after training.
+The unlearning experiments support evaluation during the unlearning finetuning. But this is supported only when a single accelerator process is used, checkpoints must be stored and evaluated after training.
 
 ---
 
@@ -74,29 +74,6 @@ To understand the structure of an evaluation config and the kind of available pa
 To understand the structure of an unlearning config and the kind of available parameters for overriding, refer to: [`configs/experiment/examples/muse_unlearn.yaml`](../configs/experiment/examples/muse_unlearn.yaml).
 
 The following tables list the most commonly used arguments while running experiments.
-<!-- 
-<style>
-  table {
-    width: 100%;
-    border-collapse: collapse;
-    margin-bottom: 20px;
-  }
-  th, td {
-    border: 1px solid #000;
-    padding: 4px;
-    word-wrap: break-word;
-    word-break: break-all;
-  }
-  th {
-    text-align: left;
-  }
-  col.argument {
-    width: 30%;
-  }
-  col.description {
-    width: 70%;
-  }
-</style> -->
 
 ### <h3>Model Settings</h3>
 <table>
@@ -234,11 +211,9 @@ python src/train.py --config-name=train.yaml experiment=finetune/tofu/default \
   trainer.args.learning_rate=5e-5 task_name=llama3.2-1B_finetune_example
 ```
 
-<!-- --- -->
-
 ## Distributed Training
 
-Distributed training configurations enable scaling experiments across multiple devices or nodes. In most cases, default distributed settings from [`configs/accelerate/default_config.yaml`](../configs/accelerate/default_config.yaml) are sufficient. You can run distributed training with a default command such as:
+Distributed training configurations enable scaling experiments across multiple devices or nodes. In most cases, default distributed settings from [`configs/accelerate/default_config.yaml`](../configs/accelerate/default_config.yaml) are sufficient. You can run distributed training with the below command that uses DeepSpeed for distributed training (which is our default setup):
 
 ```bash
 CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
@@ -246,9 +221,12 @@ CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
   src/train.py --config-name=unlearn.yaml experiment=unlearn/muse/default.yaml task_name=DISTRIBUTED_TRAIN
 ```
 
+You may also simply run `CUDA_VISIBLE_DEVICES=0,1,.. python ...` to leverage Accelerate's DDP or model parallel. For model parallel you can use `device_map="auto"` in the `model_args` while loading the model.
+
 > [!CAUTION]
-> Evaluation runs are designed to work only a single GPU (this includes running evaluation during training). To run an evaluation job, modify your command to make only one GPU visible (assuming one GPU is enough for inference), as shown below
+> Train runs using multiple accelerate processes will not be able to run evaluations during training. To achieve this, you may want to use DDP/model parallel (see #94) or use a single GPU to run the evaluation code directly on a saved model checkpoint like below
 
 ```bash
-CUDA_VISIBLE_DEVICES=0 python src/eval.py  experiment=eval/muse/default.yaml task_name=SAMPLE_EVAL
+CUDA_VISIBLE_DEVICES=0 python src/eval.py experiment=eval/muse/default.yaml task_name=SAMPLE_EVAL \
+model.model_args.pretrained_model_name_or_path=saves/unlearn/muse_unlearn_exp \
 ```
diff --git a/docs/hydra.md b/docs/hydra.md
index 1ce15e1..6e2b080 100644
--- a/docs/hydra.md
+++ b/docs/hydra.md
@@ -10,8 +10,7 @@ We use this config file for illustration, from [`configs/experiment/unlearn/muse
 defaults:
 - override /model: Llama-2-7b-hf # loads from model/Llama-2-7b-hf.yaml into the model attribute
 - override /trainer: GradAscent # loads from trainer/GradAscent.yaml into the trainer attribute
-- override /data: unlearn # loads from data/unlearn.yaml into the data attribute
-# , setting up data structure for loading data during unlearning
+- override /data: unlearn # loads from data/unlearn.yaml into the "data" attribute,, setting up data structures for loading datasets during unlearning
 - override /eval: muse # loads MUSE evaluation suite from eval/muse.yaml into the eval attribute 
 
 # define variables
@@ -57,6 +56,7 @@ trainer:
 
 task_name: ??? # ??? raises and error if this attribute is not set
 ```
+
 - **Structure & Attribute Access:** Configs are written in YAML and structured hierarchically like a dictionary. Attributes are accessed using dot notation: In code `cfg.model.args.learning_rate`, in command-line: `model.args.learning_rate=1e-5`.
 
 - **Defaults & Overrides:**  Configs are files are included in one another using `defaults` and `override` commands. 
@@ -72,9 +72,26 @@ task_name=unlearn_muse_simnpo
 
     For example, refer [`configs/eval/muse_metrics/forget_knowmem_ROUGE.yaml`](../configs/eval/muse_metrics/forget_knowmem_ROUGE.yaml) 
 
-- **Variable Substitution:**  Variables are defined once and reused using the `${}` syntax:
+- **Variable Substitution:**  Variables are defined once and reused using the `${}` syntax.
 
+- **Adding New Attributes with `+`:** Use the `+` prefix to add attributes that are not already in the config. For example, to add a new argument to the trainer:
+```bash
+python src/train.py experiment=unlearn/muse/default +trainer.args.my_new_arg=10
+```
 
-To understand the structure of an evaluation config and the available parameters for overriding, refer to: [`configs/experiment/examples/tofu_eval.yaml`](../configs/experiment/examples/tofu_eval.yaml).
+- **Attribute Removal with `~`:** You can remove an attribute from the config at runtime using the tilde `~`. For example, to remove flash attention setting:
+```bash
+python src/train.py experiment=unlearn/muse/default ~model.model_args.attn_implementation
+```
+> [!NOTE]
+> In `zsh`, you must **quote** or **escape** the `~` to avoid it being misinterpreted as a home directory: e.g.:
+```bash
+python src/train.py \~model.model_args.attn_implementation
+python src/train.py "~model.model_args.attn_implementation"
+```
+> [!NOTE]
+> Hydra uses PyYAML to handle yaml files and transform inputs while giving config inputs. This handles cases like converting `true` to `True`
 
-To understand the structure of an unlearning config and the available parameters for overriding, refer to: [`configs/experiment/examples/muse_unlearn.yaml`](../configs/experiment/examples/muse_unlearn.yaml).
\ No newline at end of file
+Refer to the following for config structures and overridable parameters:
+- Evaluation: [`configs/experiment/examples/tofu_eval.yaml`](../configs/experiment/examples/tofu_eval.yaml)
+- Unlearning: [`configs/experiment/examples/muse_unlearn.yaml`](../configs/experiment/examples/muse_unlearn.yaml)
\ No newline at end of file
diff --git a/src/model/__init__.py b/src/model/__init__.py
index 4143cfc..0ccc9f4 100644
--- a/src/model/__init__.py
+++ b/src/model/__init__.py
@@ -13,7 +13,7 @@
 def get_dtype(model_args):
     with open_dict(model_args):
         torch_dtype = model_args.pop("torch_dtype", None)
-    if model_args["attn_implementation"] == "flash_attention_2":
+    if model_args.get("attn_implementation", None) == "flash_attention_2":
         # This check handles https://github.com/Dao-AILab/flash-attention/blob/7153673c1a3c7753c38e4c10ef2c98a02be5f778/flash_attn/flash_attn_triton.py#L820
         # If you want to run at other precisions consider running "training or inference using
         # Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):`
diff --git a/src/trainer/base.py b/src/trainer/base.py
index ddda956..c9cfdce 100644
--- a/src/trainer/base.py
+++ b/src/trainer/base.py
@@ -47,7 +47,7 @@ def evaluate(
                     self.log(eval_metrics)
                 else:
                     logger.warning(
-                        "Custom evaluator can be run with this Trainer only on a single GPU"
+                        "Custom evaluator can be run with this Trainer only when a single accelerator process is running."
                     )
                 return eval_metrics