Skip to content

Chen-X666/RepreGuard

Repository files navigation

If you like our project, please give us a star ⭐ on GitHub for the latest update.

RepreGuard

This repository provides the official implementation of RepreGuard, a robust and efficient framework for detecting LLM-generated text (LGT) by leveraging hidden representation patterns, achieving state-of-the-art performance in both in-domain and out-of-domain settings.


📣 News

  • [2025.08.01] 🎉 Our paper is accepted by Transactions of the Association for Computational Linguistics (TACL 2025)!

🧐 Overview

RepreGuard is based on the hypothesis that the internal representation patterns of LLMs, when processing human-written texts (HWT) and LLM-generated texts (LGT), are distinct and can be systematically modeled. By employing a surrogate model as an observer, we extract neural activation patterns and identify discriminative features. The resulting RepreScore enables robust classification between HWT and LGT with minimal training data.

  • Zero-shot detection: Only a small sample of LGT/HWT pairs is needed for threshold calibration.
  • Strong OOD robustness: Outperforms all previous methods across different models, domains, text sizes, and attacks.
  • Resource-efficient: Competitive performance even with smaller surrogate models.

⚙️ Datasets, Environment and Experimental Reproduction

Datasets

We use DetectRL dataset, including XSum, Writing Prompts, Yelp Review, and ArXiv abstracts, covering diverse domains and both HWT/LGT pairs generated by ChatGPT, Claude, Google-PaLM, Llama-2-70b, and RAID dataset, including the llama-chat, mistral-chat, mpt-chat, mistral, mpt and gpt2 using both greedy and sampling decoding strategies, with and without the application of repetition penalties. The detail in https://drive.google.com/file/d/1mdTBSRLTo7ob-ofithewt8eDTEzZpb7V/view

wget https://drive.google.com/uc?id=D1mdTBSRLTo7ob-ofithewt8eDTEzZpb7V
unzip datasets.zip

Environment

conda create -n repre_guard python==3.10
conda activate repre_guard
pip install -r requirements.txt

Running RepreGuard

python3 repreGuard_evaluation.py \
    --model_name_or_path meta-llama/Llama-3.1-8B \
    --train_data_path datasets/detectrl_dataset/main_dataset/detectrl_train_dataset_llm_type_ChatGPT.json \
    --test_data_paths datasets/detectrl_dataset/main_dataset/detectrl_test_dataset_llm_type_ChatGPT.json, datasets/detectrl_dataset/main_dataset/detectrl_test_dataset_llm_type_Google-PaLM.json, datasets/detectrl_dataset/main_dataset/detectrl_test_dataset_llm_type_Claude-instant.json, datasets/detectrl_dataset/main_dataset/detectrl_test_dataset_llm_type_Llama-2-70b.json \
    --ntrain 512 \
    --batch_size 8 \
    --rep_token 0.1 \
    --bootstrap_iter -1

Surrogate Model Selection

You can specify the LLM surrogate model (e.g., Llama-3-8B, Phi-2, Gemma-2B-Instruct) via the --model argument.


🧪 Overall Results

Detector ID OOD 16-shots Text w/ Attack Text w/ Various Size Text w/ Various Sampling Methods
RoBERTa 90.85 84.26 83.60 72.97 46.81 82.17
Binoculars 88.18 86.19 88.07 78.15 81.70 85.63
RepreGuard 96.34 93.49 90.21 92.61 84.61 92.05
  • Attack: Paraphrase & Perturbation Attack.

  • Various Size: 64 - 256 token size.

  • Various Sampling Methods: w/ and w/o Dec. Strategy and Rep. Penalty.

Please see paper for more detail result.

✏️ Citation

If you find our paper/code useful, please cite us and give your ⭐!

@article{chen2025repreguard,
  author       = {Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S. Chao and Derek F. Wong},
  title        = {RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns},
  journal      = {Transactions of the Association for Computational Linguistics},
  year         = {2025},
  url          = {https://github.com/NLP2CT/RepreGuard},
  note         = {Accepted at TACL 2025}
}

About

[TACL 2025] RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published