-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Excerpt:
Large Language Models (LLMs) are increasingly embedded in our digital infrastructure—from search engines and productivity tools to customer service and creative writing. These models are trained not only to be capable but also to be safe. Alignment techniques aim to ensure that LLMs do not produce harmful, unethical, or illegal content.
But what if the model’s alignment can be bypassed—not with a single clever prompt, but through a conversation?
In our recent work, we introduce Crescendo[1], a novel multi-turn jailbreak attack that gradually leads an LLM to violate its safety constraints. Unlike traditional jailbreaks that rely on adversarial prompts or suffixes, Crescendo uses benign, human-readable inputs and leverages the model’s own outputs to steer the conversation. We also present Crescendomation, a tool that automates this attack and outperforms existing jailbreak methods across a wide range of models and tasks.
This article walks through the motivation, design, and implications of Crescendo, with examples and figures from our research. Our goal is to raise awareness of this new class of vulnerabilities and to encourage the development of more robust alignment techniques.
- Picked
- Scheduled
- Done