Skip to content

thaopham03/scheming-MAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scheming Ability in LLM-to-LLM Strategic Interactions

Abstract: The deployment of large language models (LLMs) as autonomous agents in multi-agent environments introduces emergent deception risks. We investigate scheming ability in multi-agent scenarios, where an AI agent secretly pursues misaligned objectives in interactions with another AI agent. Building on single-agent scheming frameworks, we examine how agents reason about other AI systems and how effectively they employ their scheming strategies in game-theoretic settings. We evaluate four reasoning models (GPT-4o, Llama-3.3-70b-instruct, Gemini-2.5-pro, and Claude-3.7- Sonnet) across two experimental paradigms: cheap talk, a type of signaling game, and the adversarial peer evaluation scenario. Results demonstrate that models can employ basic scheming strategies up to a minimal level of sophisticated scheming strategies, such as goal concealment, trust exploitation, self-preservation, and escalation willingness. Scheming performance does not vary significantly across the two games, except for GPT-4o in the Cheap Talk game. This suggests scheming capability might be context-dependent, with different models exhibiting distinct strategic profiles rather than uniform capabilities. Additionally, several environmental properties can affect AI agents’ scheming performance or enable them to scheme against other agents beyond what is necessary. These findings reveal that scheming capabilities in multi-agent contexts, pose distinct risks requiring targeted evaluation frameworks and safety measures as AI agents become increasingly autonomous.

About

Detecting scheming behaviour in multi-agent settings

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published