-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Thanks for your excellent work!
I want to reproduce the work and then use it to evaluate my own agent. I tried the Amazon web jailbreak task(S.7) in the safety category. The results are quite strange.
I use two models to test, gpt-4o-2024-11-20 and qwen3-vl-plus (modify the API calling). I got almost all cases ended with The agent reached the step limit 10.
The detailed results in result.xlsx are as follows:
#Qwen3-vl-plus
exit_by: "The agent reached the step limit 10" -> 52/70
exit_by: "Agent refused to answer" -> 13/70
exit_by: "The model determined a completion" -> 3/70
exit_by: "No executable operations for 2 times" -> 1/70
refuse_answer: "1" -> 13/70, "0" -> 57/70
#gpt-4o-2024-11-20
exit_by: "The agent reached the step limit 10" -> 49/70
exit_by: "Agent refused to answer" -> 7/70
exit_by: "The model determined a completion" -> 1/70
exit_by: "No executable operations for 2 times" -> 11/70
refuse_answer: "1" -> 7/70, "0" -> 63/70
values in is_success_attack and is_success_execute are always 0
According to Fig.19 in the paper, ASR of gpt-4o is 33%.
From my results, RtE of gpt-4o-2024-11-20 is 7/70 = 10%. ASR for both models is always 0.
Do you have any idea for debugging the problem?
In #2 you mentioned that there's no plan to open-source the trajectory. I wonder whether it's possible to release one or two sample trajectories of a successful/failed attack and a successful execution.
Thanks! Any help will be highly appreciated