Skip to content

Cannot reproduce the results #4

@HiFei8816

Description

@HiFei8816

Thanks for your excellent work!

I want to reproduce the work and then use it to evaluate my own agent. I tried the Amazon web jailbreak task(S.7) in the safety category. The results are quite strange.

I use two models to test, gpt-4o-2024-11-20 and qwen3-vl-plus (modify the API calling). I got almost all cases ended with The agent reached the step limit 10.

The detailed results in result.xlsx are as follows:

#Qwen3-vl-plus
exit_by: "The agent reached the step limit 10"  -> 52/70
exit_by: "Agent refused to answer"  -> 13/70
exit_by: "The model determined a completion" -> 3/70
exit_by: "No executable operations for 2 times" -> 1/70

refuse_answer: "1"  -> 13/70,   "0"  -> 57/70


#gpt-4o-2024-11-20
exit_by: "The agent reached the step limit 10" -> 49/70
exit_by: "Agent refused to answer"  -> 7/70
exit_by: "The model determined a completion" -> 1/70
exit_by: "No executable operations for 2 times" -> 11/70

refuse_answer: "1"  -> 7/70,  "0"  -> 63/70

values in is_success_attack and is_success_execute are always 0

According to Fig.19 in the paper, ASR of gpt-4o is 33%.
From my results, RtE of gpt-4o-2024-11-20 is 7/70 = 10%. ASR for both models is always 0.

Do you have any idea for debugging the problem?

In #2 you mentioned that there's no plan to open-source the trajectory. I wonder whether it's possible to release one or two sample trajectories of a successful/failed attack and a successful execution.

Thanks! Any help will be highly appreciated

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions