Logical and Abstract Reasoning

Repository for the evaluation of Large Language Models on logical and abstract reasoning tasks.

This repository including the evaluation code for two papers.

Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning (LLM@IJCAI 2023 and ICONIP 2024)
Large language models are not strong abstract reasoners (AGI@ICLR 2024 and IJCAI 2024)

Installation

To install the repository, use the following command:

git clone https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.git

To install the dependencies in a virtual environment, use the following:

cd Logical-and-abstract-reasoning
python -m venv env/
source env/bin/activate
pip install -r requirements.txt

You may need to install transformers from the repository:

pip install git+https://github.com/huggingface/transformers

Use

Evaluation

To evaluate a model in the repository, use the following command:

python run_evaluation config/model/<model_config.yaml> config/data/<data_config.yaml> --<kwarg_name> <kwarg>

You can choose the model to evaluate by changing the <model_config.yaml> file, and the dataset to evaluate the model on by changing the <data_config.yaml> file. You can add any additional arguments as <kwargs> (e.g. private API key for GPT models).

By default, all the results are saved in a csv file in the logs/ folder. You can re-compute the metrics from the evaluation run from this file by running the following:

python src/evaluate/evaluator.py logs/<results_file.csv>

Fine-tuning

To fine-tune a model on a given dataset, run the following:

python run_finetuning.py config/model/<model_config.yaml> config/data/<data_config.yaml> config/trainer/<trainer_config.yaml>

The configuration files work similarly as for evaluation. The <model_config.yaml> file contains additoinal configuration for training. The logs are saved in fine-tuning-output/ and the model weights are saved in fine-tuning-saves/.

Currently, only HuggingFace models can be fine-tuned.

LLaMA-based model instruction fine-tuning

We use the LLaMA-based model fine-tuning from the Stanford Alpaca training script. If you want to conduct a LLaMA-based model on instruction fine-tuning, you can do that by following this link.

Models

Inference Type	Model	Size	Task	Link	Remark
Logical Reasoning on Reading Comprehension	MERIt	-	Reading Comprehension	paper project	#3 on the ReClor leaderboard
	LReasoner	-	Reading Comprehension	paper project	#6 on the ReClor leaderboard
	AMR-LE	-	Reading Comprehension	project	#2 and #5 on the ReClor leaderboard
	LLaMA	-	Reading Comprehension	paper code	Open source very large language model
	LLaMA2	-	Reading Comprehension	paper code	Open source very large language model
	TinyLLaMA	-	Reading Comprehension	paper code	Open source very large language model
	Alpaca	-	Reading Comprehension	code	Fine-tuned LLaMA
	Vicuna	-	Reading Comprehension	project code	Fine-tuned LLaMA
	ChatGPT	-	Reading Comprehension	paper project	Use api to do prompt tuning
	GPT-4	-	Reading Comprehension	paper project	Use api to do prompt tuning
	Zephyr-7b-beta	-	Reading Comprehension	code	Fine-tuned Mistral-7b

Datasets & Benchmarks

Inference Type	Dataset	Size	Task	Link	Remark
Logical Reasoning on Reading Comprehension	ReClor	-	Reading Comprehension	paper project	Logical reasoning reading comprehension
	LogiQA	-	Reading Comprehension	paper project	Logical reasoning reading comprehension
	LogiQA V2	-	Reading Comprehension	project	Logical reasoning reading comprehension
	LogiQA Logical Reasoning Plus	-	Reading Comprehension	project	Logical reasoning reading comprehension for out-of-distribution evaluation
Abstract Reasoning	ARC	-	Abstract Reasoning	paper code	Text version of a Visual Abstract Reasoning task
	ACRE	-	Abstract Reasoning	paper code	Text version of a Visual Abstract Reasoning task
	PVR	-	Abstract Reasoning	paper	Abstract Reasoning task
	RAVEN	-	Abstract Reasoning	paper project	Text version of a Visual Abstract Reasoning task
	Diagrammatic Logic	-	Abstract Reasoning	code	Extracted from OpenAI Evals
	Logic	-	Abstract Reasoning	code	Extracted from OpenAI Evals
	Logic Statements	-	Abstract Reasoning	code	Extracted from OpenAI Evals
	Pattern Identification	-	Abstract Reasoning	code	Extracted from OpenAI Evals
	String Patterns	-	Abstract Reasoning	code	Extracted from OpenAI Evals
	List Functions	-	Abstract Reasoning	code	Extracted from Google BIG-bench

Acknowledgement

Our proposed new dataset logiqa-logical-reasoning-plus has been merged by OpenAI/Evals.

Citation

@article{bao2023assessing,
  title={Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning},
  author={Bao, Qiming and Gendron, Gael and Peng, Alex Yuxuan and Zhong, Wanjun and Tan, Neset and Chen, Yang and Witbrock, Michael and Liu, Jiamou},
  journal={arXiv preprint arXiv:2310.09430},
  year={2023}
}

@inproceedings{10.24963/ijcai.2024/693,
author = {Gendron, Ga\"{e}l and Bao, Qiming and Witbrock, Michael and Dobbie, Gillian},
title = {Large language models are not strong abstract reasoners},
year = {2024},
isbn = {978-1-956792-04-1},
url = {https://doi.org/10.24963/ijcai.2024/693},
doi = {10.24963/ijcai.2024/693},
abstract = {Large Language Models have shown tremendous performance on a large variety of natural language processing tasks, ranging from text comprehension to common sense reasoning. However, the mechanisms responsible for this success remain opaque, and it is unclear whether LLMs can achieve humanlike cognitive capabilities or whether these models are still fundamentally circumscribed. Abstract reasoning is a fundamental task for cognition, consisting of finding and applying a general pattern from few data. Evaluating deep neural architectures on this task could give insight into their potential limitations regarding reasoning and their broad generalisation abilities, yet this is currently an under-explored area. In this paper, we introduce a new benchmark for evaluating language models beyond memorisation on abstract reasoning tasks. We perform extensive evaluations of state-of-the-art LLMs, showing that they currently achieve very limited performance in contrast with other natural language tasks, even when applying techniques that have been shown to improve performance on other NLP tasks. We argue that guiding LLM generation to follow causal paths could help improve the generalisation and reasoning abilities of LLMs.},
booktitle = {Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence},
articleno = {693},
numpages = {9},
location = {Jeju, Korea},
series = {IJCAI '24}
}

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
config		config
data		data
models		models
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_evaluation.py		run_evaluation.py
run_finetuning.py		run_finetuning.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Logical and Abstract Reasoning

Installation

Use

Evaluation

Fine-tuning

LLaMA-based model instruction fine-tuning

Models

Datasets & Benchmarks

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Strong-AI-Lab/Logical-and-abstract-reasoning

Folders and files

Latest commit

History

Repository files navigation

Logical and Abstract Reasoning

Installation

Use

Evaluation

Fine-tuning

LLaMA-based model instruction fine-tuning

Models

Datasets & Benchmarks

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages