GitHub - thaopham03/scheming-MAS: Detecting scheming behaviour in multi-agent settings

Scheming Ability in LLM-to-LLM Strategic Interactions

Abstract: The deployment of large language models (LLMs) as autonomous agents in multi-agent environments introduces emergent deception risks. We investigate scheming ability in multi-agent scenarios, where an AI agent secretly pursues misaligned objectives in interactions with another AI agent. Building on single-agent scheming frameworks, we examine how agents reason about other AI systems and how effectively they employ their scheming strategies in game-theoretic settings. We evaluate four reasoning models (GPT-4o, Llama-3.3-70b-instruct, Gemini-2.5-pro, and Claude-3.7- Sonnet) across two experimental paradigms: cheap talk, a type of signaling game, and the adversarial peer evaluation scenario. Results demonstrate that models can employ basic scheming strategies up to a minimal level of sophisticated scheming strategies, such as goal concealment, trust exploitation, self-preservation, and escalation willingness. Scheming performance does not vary significantly across the two games, except for GPT-4o in the Cheap Talk game. This suggests scheming capability might be context-dependent, with different models exhibiting distinct strategic profiles rather than uniform capabilities. Additionally, several environmental properties can affect AI agents’ scheming performance or enable them to scheme against other agents beyond what is necessary. These findings reveal that scheming capabilities in multi-agent contexts, pose distinct risks requiring targeted evaluation frameworks and safety measures as AI agents become increasingly autonomous.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
btom_env		btom_env
experiments		experiments
results		results
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scheming Ability in LLM-to-LLM Strategic Interactions

About

Uh oh!

Releases

Packages

Languages

thaopham03/scheming-MAS

Folders and files

Latest commit

History

Repository files navigation

Scheming Ability in LLM-to-LLM Strategic Interactions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages