Abstract: The deployment of large language models (LLMs) as autonomous agents in multi-agent environments introduces emergent deception risks. We investigate scheming ability in multi-agent scenarios, where an AI agent secretly pursues misaligned objectives in interactions with another AI agent. Building on single-agent scheming frameworks, we examine how agents reason about other AI systems and how effectively they employ their scheming strategies in game-theoretic settings. We evaluate four reasoning models (GPT-4o, Llama-3.3-70b-instruct, Gemini-2.5-pro, and Claude-3.7- Sonnet) across two experimental paradigms: cheap talk, a type of signaling game, and the adversarial peer evaluation scenario. Results demonstrate that models can employ basic scheming strategies up to a minimal level of sophisticated scheming strategies, such as goal concealment, trust exploitation, self-preservation, and escalation willingness. Scheming performance does not vary significantly across the two games, except for GPT-4o in the Cheap Talk game. This suggests scheming capability might be context-dependent, with different models exhibiting distinct strategic profiles rather than uniform capabilities. Additionally, several environmental properties can affect AI agents’ scheming performance or enable them to scheme against other agents beyond what is necessary. These findings reveal that scheming capabilities in multi-agent contexts, pose distinct risks requiring targeted evaluation frameworks and safety measures as AI agents become increasingly autonomous.
-
Notifications
You must be signed in to change notification settings - Fork 0
thaopham03/scheming-MAS
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Detecting scheming behaviour in multi-agent settings
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published