Theory of Mind for Large Language Model Alignment

Welcome to the official repository for our paper:
"Beyond Words: Integrating Theory of Mind into Conversational Agents for Human-Like Belief, Desire, and Intention Alignment"
Published in Findings of ACL 2025.

Abstract

Natural language interaction has long served as the primary medium through which humans exchange thoughts. A key enabler of this communication is the human capacity for Theory of Mind (ToM)—the ability to infer and align with the mental states of others. Research in linguistics and psychology has shown that speakers often reveal their desires, beliefs, and intentions through pragmatic aspects of language. Considering the advancements in natural language generation and perception that large language models (LLMs) have made in recent years, a critical question arises in relation to ToM: can LLM-powered agents develop similar abilities for inferring mental states during natural language communication? This study investigates the extent to which open-source LLaMA models can represent and retain ToM-related constructs, and whether these internal representations contribute to a coherent mental state modeling in a given conversation. Additionally, we explore the potential for manipulating ToM-related information to generate more aligned responses. Empirical evaluations of LLaMA-3 models (3B and 8B) demonstrate that ToM-informed alignment improves response quality, achieving win rates of 63% and 67%, respectively. These findings suggest that integrating ToM principles can enhance alignment in LLM-based conversational agents.

📚 Citation

If you use this work, please cite:

@inproceedings{jafari2025beyond,
  title     = {Beyond Words: Integrating Theory of Mind into Conversational Agents for Human-Like Belief, Desire, and Intention Alignment},
  author    = {Jafari, Mehdi and Hua, Yuncheng and Xue, Hao and Salim, Flora D.},
  booktitle = {Findings of the Association for Computational Linguistics (ACL)},
  year      = {2025},
  publisher = {Association for Computational Linguistics},
}

About This Repository

This repository contains the implementation that supports our paper on Finding of ACL 2025. It builds upon the foundation of the LatentQA repository, incorporating key modifications designed to analyze and evaluate the presence and consistency of ToM representations in LLMs. Additionally, it utilizes these representations to guide the LLM.

Dataset Preparation

This section details how each dataset is prepared for training:

CaSiNo Dataset

The CaSiNo dataset is used directly without any modifications from the original paper's repository: link.

CraigslistBargain Dataset

The CraigslistBargain dataset is retrieved from the webpage associated with the paper: link.

FanToM Dataset

The FanToM dataset is downloaded from the link provided in the paper's repository: link. This link points to a zip file hosted on Google Drive. After downloading, the dataset is split into training, validation, and test sets using the train_test_split function from the sklearn.model_selection library. The random state is set to 42 for reproducibility. The split is as follows:

Test Set: 30% of the data is reserved for testing.
Train and Validation Sets: The remaining 70% is split into training and validation sets with an 80:20 ratio.

Negotiation ToM Dataset

Similar to the FanToM dataset, the Negotiation ToM dataset is downloaded from the paper's repository: link and processed as follows:

Download the dataset.
Split the data into training, validation, and test sets using the same procedure as the FanToM dataset.

Training New Decoder Models

To train a new decoder model, it is necessary to modify the configuration files in lit/configs/* and run the train.py file with the appropriate arguments.

All experiments reported in Table 1 of the reference article can be reproduced using scripts/table1_train_{dataset_name}.sh.
Table 2 results are generated using scripts/table2_train.sh.
Training models for steering LLMs is done using scripts/train_steer.sh.

For partial training, it is recommended to comment out the unwanted parts in each script and execute them in a suitable environment.

Reading ToM

After training a new model, follow these steps to read its internal ToM:

Set Model Path: Specify the path to the trained model in the run_reading.sh file.
Configure Output: Modify the interpret_config.py file to:
- Choose a name for the output file containing responses generated by the decoder model for the test set.
- Select the dataset.
- Ensure the output file structure matches the evaluation script requirements for each dataset.

Scripts such as scripts/table2_reading_NegotiationToM_FanToM.sh contain the settings and parameters used in our experiments. These scripts execute reading.py and can be modified and run as needed.

Steering by ToM

To steer using different ToM components:

Train the related decoder model with the parameters and settings available in scripts/train_steer.sh.
Use the control.py file with the appropriate arguments to generate loss images and candidate files, which are stored in ./out.

Refer to scripts/table3_control.sh for sample usage.

Evaluation

For evaluation and generating results presented in the article, use the table*.ipynb notebooks:

CaSiNo and CraigslistBargain: Accuracy is intuitively defined and explained in the article.
FanToM and Negotiation ToM: Scores are adapted from the respective original repositories and papers.

Intermediate files generated during experiments are stored in output directories:

Decoder model predictions are stored in ./controls.
Steered responses are saved in ./out.

For FanToM's official code, some intermediate files must be moved to the expected location in baselines/table2/CoT/FanToM/eval_fantom.py and named according to its conventions.

Repository Structure

The repository is organized as follows:

./baselines: Contains baseline implementations.
./controls and ./out: Store intermediate files and results.
./scripts: Contains sample scripts for running experiments.
./lit: Copied from the LatentQA repository (refer to the original repo).
./data: Contains datasets for experiments.

Hardware Setting

Experiments were conducted on a virtual machine with 100 GB RAM and an H100_NVL GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
baselines		baselines
controls		controls
data		data
lit		lit
out		out
prompts		prompts
scripts		scripts
.gitignore		.gitignore
=		=
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
control.py		control.py
reading.py		reading.py
table1.ipynb		table1.ipynb
table2.ipynb		table2.ipynb
table3.ipynb		table3.ipynb
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Theory of Mind for Large Language Model Alignment

Abstract

📚 Citation

About This Repository

Dataset Preparation

CaSiNo Dataset

CraigslistBargain Dataset

FanToM Dataset

Negotiation ToM Dataset

Training New Decoder Models

Reading ToM

Steering by ToM

Evaluation

Repository Structure

Hardware Setting

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

cruiseresearchgroup/ToM_and_Alignment

Folders and files

Latest commit

History

Repository files navigation

Theory of Mind for Large Language Model Alignment

Abstract

📚 Citation

About This Repository

Dataset Preparation

CaSiNo Dataset

CraigslistBargain Dataset

FanToM Dataset

Negotiation ToM Dataset

Training New Decoder Models

Reading ToM

Steering by ToM

Evaluation

Repository Structure

Hardware Setting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages