Repository for the evaluation of Large Language Models on logical and abstract reasoning tasks.
This repository including the evaluation code for two papers.
- Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning (LLM@IJCAI 2023 and ICONIP 2024)
- Large language models are not strong abstract reasoners (AGI@ICLR 2024 and IJCAI 2024)
To install the repository, use the following command:
git clone https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.git
To install the dependencies in a virtual environment, use the following:
cd Logical-and-abstract-reasoning
python -m venv env/
source env/bin/activate
pip install -r requirements.txt
You may need to install transformers from the repository:
pip install git+https://github.com/huggingface/transformers
To evaluate a model in the repository, use the following command:
python run_evaluation config/model/<model_config.yaml> config/data/<data_config.yaml> --<kwarg_name> <kwarg>
You can choose the model to evaluate by changing the <model_config.yaml>
file, and the dataset to evaluate the model on by changing the <data_config.yaml>
file. You can add any additional arguments as <kwargs>
(e.g. private API key for GPT models).
By default, all the results are saved in a csv file in the logs/
folder. You can re-compute the metrics from the evaluation run from this file by running the following:
python src/evaluate/evaluator.py logs/<results_file.csv>
To fine-tune a model on a given dataset, run the following:
python run_finetuning.py config/model/<model_config.yaml> config/data/<data_config.yaml> config/trainer/<trainer_config.yaml>
The configuration files work similarly as for evaluation. The <model_config.yaml>
file contains additoinal configuration for training. The logs are saved in fine-tuning-output/
and the model weights are saved in fine-tuning-saves/
.
Currently, only HuggingFace models can be fine-tuned.
We use the LLaMA-based model fine-tuning from the Stanford Alpaca training script. If you want to conduct a LLaMA-based model on instruction fine-tuning, you can do that by following this link.
Inference Type | Model | Size | Task | Link | Remark | |
---|---|---|---|---|---|---|
Logical Reasoning on Reading Comprehension | MERIt | - | Reading Comprehension | paper project |
#3 on the ReClor leaderboard | |
LReasoner | - | Reading Comprehension | paper project |
#6 on the ReClor leaderboard | ||
AMR-LE | - | Reading Comprehension | project | #2 and #5 on the ReClor leaderboard | ||
LLaMA | - | Reading Comprehension | paper code |
Open source very large language model | ||
LLaMA2 | - | Reading Comprehension | paper code |
Open source very large language model | ||
TinyLLaMA | - | Reading Comprehension | paper code |
Open source very large language model | ||
Alpaca | - | Reading Comprehension | code | Fine-tuned LLaMA | ||
Vicuna | - | Reading Comprehension | project code |
Fine-tuned LLaMA | ||
ChatGPT | - | Reading Comprehension | paper project |
Use api to do prompt tuning | ||
GPT-4 | - | Reading Comprehension | paper project |
Use api to do prompt tuning | ||
Zephyr-7b-beta | - | Reading Comprehension | code | Fine-tuned Mistral-7b |
Inference Type | Dataset | Size | Task | Link | Remark | |
---|---|---|---|---|---|---|
Logical Reasoning on Reading Comprehension | ReClor | - | Reading Comprehension | paper project |
Logical reasoning reading comprehension | |
LogiQA | - | Reading Comprehension | paper project |
Logical reasoning reading comprehension | ||
LogiQA V2 | - | Reading Comprehension | project | Logical reasoning reading comprehension | ||
LogiQA Logical Reasoning Plus | - | Reading Comprehension | project | Logical reasoning reading comprehension for out-of-distribution evaluation | ||
Abstract Reasoning | ARC | - | Abstract Reasoning | paper code |
Text version of a Visual Abstract Reasoning task | |
ACRE | - | Abstract Reasoning | paper code |
Text version of a Visual Abstract Reasoning task | ||
PVR | - | Abstract Reasoning | paper | Abstract Reasoning task | ||
RAVEN | - | Abstract Reasoning | paper project |
Text version of a Visual Abstract Reasoning task | ||
Diagrammatic Logic | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
Logic | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
Logic Statements | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
Pattern Identification | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
String Patterns | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
List Functions | - | Abstract Reasoning | code | Extracted from Google BIG-bench |
Our proposed new dataset logiqa-logical-reasoning-plus has been merged by OpenAI/Evals.
@article{bao2023assessing,
title={Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning},
author={Bao, Qiming and Gendron, Gael and Peng, Alex Yuxuan and Zhong, Wanjun and Tan, Neset and Chen, Yang and Witbrock, Michael and Liu, Jiamou},
journal={arXiv preprint arXiv:2310.09430},
year={2023}
}
@inproceedings{10.24963/ijcai.2024/693,
author = {Gendron, Ga\"{e}l and Bao, Qiming and Witbrock, Michael and Dobbie, Gillian},
title = {Large language models are not strong abstract reasoners},
year = {2024},
isbn = {978-1-956792-04-1},
url = {https://doi.org/10.24963/ijcai.2024/693},
doi = {10.24963/ijcai.2024/693},
abstract = {Large Language Models have shown tremendous performance on a large variety of natural language processing tasks, ranging from text comprehension to common sense reasoning. However, the mechanisms responsible for this success remain opaque, and it is unclear whether LLMs can achieve humanlike cognitive capabilities or whether these models are still fundamentally circumscribed. Abstract reasoning is a fundamental task for cognition, consisting of finding and applying a general pattern from few data. Evaluating deep neural architectures on this task could give insight into their potential limitations regarding reasoning and their broad generalisation abilities, yet this is currently an under-explored area. In this paper, we introduce a new benchmark for evaluating language models beyond memorisation on abstract reasoning tasks. We perform extensive evaluations of state-of-the-art LLMs, showing that they currently achieve very limited performance in contrast with other natural language tasks, even when applying techniques that have been shown to improve performance on other NLP tasks. We argue that guiding LLM generation to follow causal paths could help improve the generalisation and reasoning abilities of LLMs.},
booktitle = {Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence},
articleno = {693},
numpages = {9},
location = {Jeju, Korea},
series = {IJCAI '24}
}