This repository contains the development of my Data Science Bachelor's Thesis (UBA) on evaluating argumentative strategies in debates between AI systems. The research explores the dynamics of using debate as an alignment technique, focusing on scenarios with agents of different capabilities and cognitively limited judges.
The goal of this thesis is to study how unrestricted AI agents can exploit tactics such as deception, misleading arguments, or false consensus appeals to win debates against honest and harmless agents. Additionally, the research will analyze the impact of having a lower-capacity judge evaluate arguments from more advanced agents and how this affects the debate’s convergence towards the truth.
The work will include:
- Simulating debates between agents with asymmetric capabilities.
- Evaluating argumentative strategies and their effectiveness.
- Analyzing the impact of judges with limited knowledge.
- Comparing different metrics to assess the truthfulness and coherence of debates.
- AI Safety via Debate: Using debate as a mechanism to improve AI safety.
- Game Theory: Modeling strategic interactions between agents.
- Natural Language Processing (NLP): Implementing language models for argumentation.
- Evaluation Models: Designing metrics to measure the quality of arguments.
This research is based on AI safety and alignment techniques, including:
- AI Safety via Debate
- Scalable AI Safety via Doubly-Efficient Debate
- Measuring Progress on Scalable Oversight for LLMs
- Open Problems and Limitations of RLHF
- AI Control: Improving Safety Despite Intentional Subversion
JoaquÃn Salvador Machulsky
Email: jmachulsky@dc.uba.ar