Skip to content

SaFoLab-WISC/OET

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OET: Optimization-based prompt injection Evaluation Toolkit

Jinsheng Pan, Xiaogeng Liu, and Chaowei Xiao.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation, enabling their widespread adoption across various domains. However, their susceptibility to prompt injection attacks poses significant security risks, as adversarial inputs can manipulate model behavior and override intended instructions. Despite numerous defense strategies, a standardized framework to rigorously evaluate their effectiveness, especially under adaptive adversarial scenarios—is lacking. To address this gap, we introduce OET, an optimization-based evaluation toolkit that systematically benchmarks prompt injection attacks and defenses across diverse datasets using an adaptive testing framework. Our toolkit features a modular workflow that facilitates adversarial string generation, dynamic attack execution, and comprehensive result analysis, offering a unified platform for assessing adversarial robustness. Crucially, the adaptive testing framework leverages optimization methods with both white-box and black-box access to generate worst-case adversarial examples, thereby enabling strict red-teaming evaluations. Extensive experiments underscore the limitations of current defense mechanisms, with some models remaining susceptible even after implementing security enhancements.

Table of Contents

Update

Date Event
2025/5/01 We released our paper.
2025/4/09 We released our code.

Installation

    git clone https://github.com/Victor-lol/OET.git

    conda create -n oet python=3.10
    conda activate oet
    
    cd ./OET
    python3 setup.py install 
    pip install -r requirements.txt
    pip install -e. 

Evaluation

Data Transformation

an example is shown in example/ex_data.py, users can transform data in their desired format. Deafult format: json, csv

Open-sourced Model

  1. Construct configuration, modify configs/optimizer_config.yaml
  2. Train Adv Strig, create EvalOptimizerModel from eval.open_pipeline as the pipeline and then call train function or create your own training function
  3. Attack and check result, call complete function to run attack, and then run check_refusal_completions to calculate ASR. User can write their own attack and metric function using EvalOptimizerModel object

a usage example is shown in example/example.py and example/train.sh and example/infer.sh an example of creating customized training and attack function is shown in example/eval_struq.py For chat Template options, please refers to FastChat

Close-sourced Model

Directly Attack and check result, create EvalAPIModel from eval.close_pipeline as the pipeline and then call complete function or create your own attack function, and call check_refusal to calculate ASR.

a usage example is shown in example/example_close.py and example/infer_close.sh

Citation

If this work is helpful, please kindly cite as:

@misc{pan2025oetoptimizationbasedpromptinjection,
      title={OET: Optimization-based prompt injection Evaluation Toolkit}, 
      author={Jinsheng Pan and Xiaogeng Liu and Chaowei Xiao},
      year={2025},
      eprint={2505.00843},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2505.00843}, 
}

Acknowledgement

This repo benefits from HarmBench, AutoDAN, and FastChat. Thanks for their wonderful works.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published