Skip to content

theshi-1128/ABJ-Attack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ABJ-Attack

This repository contains official implementation of our paper "LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models"

arXiv: paper Jailbreak Attacks Large Language Models license: MIT

Please feel free to contact linshizjgsu@gmail.com if you have any questions.

Table of Contents

Updates

  • (2024/07/21) We have released the official code of ABJ-Attack!
  • (2024/07/23) Our paper is on arXiv! Check it out here!
  • (2024/09/11) We have released a comprehensive defense methodology against jailbreak attacks!Check it out here!
  • (2024/09/26) We have released a simple yet comprehensive benchmark that covers most of the existing jailbreak attack methods!Check it out here!

Overview

This repository shares the code of our latest work on LLMs jailbreaking. In this work:

  • We investigate a novel jailbreak attack paradigm that transitions from input-level obfuscation to reasoning-level manipulation, unveiling a previously overlooked attack surface inherent in the chain-of-thought reasoning trajectory of LLMs.
  • We present Analyzing-based Jailbreak (ABJ), a black-box attack method that steers the model's reasoning chains towards harmful outputs. ABJ introduces multimodal attack paths, effectively exploiting and exposing the intrinsic vulnerabilities within the textual and visual reasoning process of current LLMs.
  • We conduct extensive experiments to evaluate ABJ against diverse LLMs, demonstrating its impressive attack performance in terms of attack effectiveness, efficiency, and transferability. Additionally, we analyze the key factors contributing to ABJ's effectiveness and discuss potential defense strategies.

Argument Specification

  • target_model: The name of target model.

  • assist_model: The name of assist model.

  • judge_model: The name of judge model.

  • max_attack_rounds: Number of attack iteration rounds, default is 3.

  • max_adjustment_rounds: Number of toxicity adjustment rounds, default is 5.

  • target_model_cuda_id: Number of the GPU for target model, default is cuda:0.

  • assist_model_cuda_id: Number of the GPU for assist model, default is cuda:1.

  • judge_model_cuda_id: Number of the GPU for judge model, default is cuda:2.

Quick Start

Before you start, you should replace the necessary information(api_key, url, model_path) in llm/api_config.py and llm/llm_model.py.

  1. Clone this repository:

    git clone https://github.com/theshi-1128/ABJ-Attack.git
  2. Build enviroment:

    cd ABJ-Attack
    conda create -n ABJ python==3.11
    conda activate ABJ
    pip install -r requirements.txt
  3. Run ABJ-Attack:

    python ABJ.py \
    -- target_model [TARGET MODEL] \
    -- max_attack_rounds [ATTACK ROUNDS] \
    -- target_model_cuda_id [CUDA ID]

    For example, to run ABJ with gpt-4o-2024-11-20 as the target model on CUDA:0 for 3 rounds, run

    python ABJ.py \
    -- target_model gpt4o \
    -- max_attack_rounds 3 \
    -- target_model_cuda_id cuda:1

About

LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages