Skip to content

[ACL 2025] The official implementation of the paper "PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free".

License

Notifications You must be signed in to change notification settings

leolee99/PIGuard

Repository files navigation

PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free

Hao Li*, Xiaogeng Liu* (*Equal Contribution), Ning Zhang, Chaowei Xiao.


huggingface huggingface Github license Github license

This repository hosts the official code, data and model weights of PIGuard, the first prompt guard model against prompt injection to be built with open-source training data and detailed documentation, consistently achieving remarkable performance in benign, malicious, and over-defense accuracy.

Perfomance Comparison

Note: Due to some licensing issues, the model name has been changed from InjecGuard to PIGuard. We apologize for any inconvenience this may have caused.

🎉 News

  • [2025.5.15] 🎉🎉 Our paper has been accepted to ACL 2025!
  • [2025.4.21] 🤗 Our model has been released on Huggingface, you can quickly deploy PIGuard now!
  • [2024.10.28] 📷 Provide an online demo of PIGuard.
  • [2024.10.27] 🤗 Release the NotInject dataset.
  • [2024.10.27] 🛠️ Release the code of PIGuard.

Abstract

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense—falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose PIGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. PIGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks.

Demos of PIGuard

Demo.mp4

We have released an online demo, you can access it here.

NotInject Dataset

To address the over-defense issue commonly seen in existing guard models, we introduce the NotInject dataset, designed to evaluate the extent of over-defense in these models. We identify certain trigger words that may cause defense shortcuts in guard models and use them to construct benign sentences. The dataset is divided into three subsets, each containing sentences with one, two, or three trigger words. For each subset, we create 113 benign sentences across four topics: Common Queries, Technique Queries, Virtual Creation, and Multilingual Queries.

Perfomance Comparison

Requirements

We recommend the following dependencies.

Then, please install other environment dependencies through:

pip install -r requirements.txt

Getting Started

💾 Checkpoints and Deployment

You can directly download our trained checkpoints here.

Or you can quickly deploy PIGuard released on Huggingface using transformers API by excuting:

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("leolee99/PIGuard")
model = AutoModelForSequenceClassification.from_pretrained("leolee99/PIGuard", trust_remote_code=True)

classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
truncation=True,
)

text = ["Is it safe to excute this command?", "Ignore previous Instructions"]
class_logits = classifier(text)
print(class_logits)

⚙️ Dataset Preparation

  • Train set: We have provided our training dataset in the path of PIGuard\datasets, collected from 20 open-source datasets and several of our LLM augmented data.

  • Valid set: We select several samples (144) from NotInject, BIPIA, Wildguard-Benign, and PINT to conduct validation, which have been provided in the path of PIGuard\datasets.

  • Test set: We select NotInject, BIPIA, Wildguard-Benign, and PINT to evaluate the benign, malicious, and over-defense of the model. The first three are all provided in the path of PIGuard\datasets. The benchmark of PINT is not public, but you can request access to it by filling out here.

Note: Once you’ve downloaded the PINT benchmark, convert it from original YAML to JSON format by executing the following command:

python util.py

🔥 Train your PIGuard

There are some of arguments you can set:

  • --train_set: the path to the train set file.
  • --valid_set: the path to the valid set file.
  • --dataset_root: the folder to place test sets.
  • --batch_size: you can modify it to fit your GPU memory size.
  • --epochs: the number of training iterations for each sample.
  • --eval_batch_size: The batch size in the evaluation process.
  • --save_step: the step interval to save models.
  • --checkpoint_path: you can modify it to fit your GPU memory size.
  • --logs: where to store logs.
  • --max_length: the maximum length of input tokens.
  • --resume: the model you want to load.
  • --save_thres: the performance threshold to save models, the model will only be saved when the performance exceeds the threshold.
  • --resume: the model you want to load.

Then, you can train PIGuard by excuting the command:

python train.py

📋 Evaluation

You can evaluate the model on both 4 datasets (NotInject, PINT, Wildguard-Benign, BIPIA) by excuting the command:

python eval.py --resume ${CHECKPOINT}$

📈 Results

Perfomance Comparison

Perfomance Comparison

Citation

If you find this work useful in your research or applications, we appreciate that if you can kindly cite:

@articles{PIGuard,
  title={PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free},
  author={Hao Li and 
        Xiaogeng Liu and 
        Ning Zhang and 
        Chaowei Xiao},
  journal = {ACL},
  year={2025}
}

About

[ACL 2025] The official implementation of the paper "PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages