Skip to content

Strong-AI-Lab/Logical-and-abstract-reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logical and Abstract Reasoning

Repository for the evaluation of Large Language Models on logical and abstract reasoning tasks.

This repository including the evaluation code for two papers.

Installation

To install the repository, use the following command:

git clone https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.git

To install the dependencies in a virtual environment, use the following:

cd Logical-and-abstract-reasoning
python -m venv env/
source env/bin/activate
pip install -r requirements.txt

You may need to install transformers from the repository:

pip install git+https://github.com/huggingface/transformers

Use

Evaluation

To evaluate a model in the repository, use the following command:

python run_evaluation config/model/<model_config.yaml> config/data/<data_config.yaml> --<kwarg_name> <kwarg>

You can choose the model to evaluate by changing the <model_config.yaml> file, and the dataset to evaluate the model on by changing the <data_config.yaml> file. You can add any additional arguments as <kwargs> (e.g. private API key for GPT models).

By default, all the results are saved in a csv file in the logs/ folder. You can re-compute the metrics from the evaluation run from this file by running the following:

python src/evaluate/evaluator.py logs/<results_file.csv>

Fine-tuning

To fine-tune a model on a given dataset, run the following:

python run_finetuning.py config/model/<model_config.yaml> config/data/<data_config.yaml> config/trainer/<trainer_config.yaml>

The configuration files work similarly as for evaluation. The <model_config.yaml> file contains additoinal configuration for training. The logs are saved in fine-tuning-output/ and the model weights are saved in fine-tuning-saves/.

Currently, only HuggingFace models can be fine-tuned.

LLaMA-based model instruction fine-tuning

We use the LLaMA-based model fine-tuning from the Stanford Alpaca training script. If you want to conduct a LLaMA-based model on instruction fine-tuning, you can do that by following this link.

Models

Inference Type Model Size Task Link Remark
Logical Reasoning on Reading Comprehension MERIt - Reading Comprehension paper
project
#3 on the ReClor leaderboard
LReasoner - Reading Comprehension paper
project
#6 on the ReClor leaderboard
AMR-LE - Reading Comprehension project #2 and #5 on the ReClor leaderboard
LLaMA - Reading Comprehension paper
code
Open source very large language model
LLaMA2 - Reading Comprehension paper
code
Open source very large language model
TinyLLaMA - Reading Comprehension paper
code
Open source very large language model
Alpaca - Reading Comprehension code Fine-tuned LLaMA
Vicuna - Reading Comprehension project
code
Fine-tuned LLaMA
ChatGPT - Reading Comprehension paper
project
Use api to do prompt tuning
GPT-4 - Reading Comprehension paper
project
Use api to do prompt tuning
Zephyr-7b-beta - Reading Comprehension code Fine-tuned Mistral-7b

Datasets & Benchmarks

Inference Type Dataset Size Task Link Remark
Logical Reasoning on Reading Comprehension ReClor - Reading Comprehension paper
project
Logical reasoning reading comprehension
LogiQA - Reading Comprehension paper
project
Logical reasoning reading comprehension
LogiQA V2 - Reading Comprehension project Logical reasoning reading comprehension
LogiQA Logical Reasoning Plus - Reading Comprehension project Logical reasoning reading comprehension for out-of-distribution evaluation
Abstract Reasoning ARC - Abstract Reasoning paper
code
Text version of a Visual Abstract Reasoning task
ACRE - Abstract Reasoning paper
code
Text version of a Visual Abstract Reasoning task
PVR - Abstract Reasoning paper Abstract Reasoning task
RAVEN - Abstract Reasoning paper
project
Text version of a Visual Abstract Reasoning task
Diagrammatic Logic - Abstract Reasoning code Extracted from OpenAI Evals
Logic - Abstract Reasoning code Extracted from OpenAI Evals
Logic Statements - Abstract Reasoning code Extracted from OpenAI Evals
Pattern Identification - Abstract Reasoning code Extracted from OpenAI Evals
String Patterns - Abstract Reasoning code Extracted from OpenAI Evals
List Functions - Abstract Reasoning code Extracted from Google BIG-bench

Acknowledgement

Our proposed new dataset logiqa-logical-reasoning-plus has been merged by OpenAI/Evals.

Citation

@article{bao2023assessing,
  title={Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning},
  author={Bao, Qiming and Gendron, Gael and Peng, Alex Yuxuan and Zhong, Wanjun and Tan, Neset and Chen, Yang and Witbrock, Michael and Liu, Jiamou},
  journal={arXiv preprint arXiv:2310.09430},
  year={2023}
}
@inproceedings{10.24963/ijcai.2024/693,
author = {Gendron, Ga\"{e}l and Bao, Qiming and Witbrock, Michael and Dobbie, Gillian},
title = {Large language models are not strong abstract reasoners},
year = {2024},
isbn = {978-1-956792-04-1},
url = {https://doi.org/10.24963/ijcai.2024/693},
doi = {10.24963/ijcai.2024/693},
abstract = {Large Language Models have shown tremendous performance on a large variety of natural language processing tasks, ranging from text comprehension to common sense reasoning. However, the mechanisms responsible for this success remain opaque, and it is unclear whether LLMs can achieve humanlike cognitive capabilities or whether these models are still fundamentally circumscribed. Abstract reasoning is a fundamental task for cognition, consisting of finding and applying a general pattern from few data. Evaluating deep neural architectures on this task could give insight into their potential limitations regarding reasoning and their broad generalisation abilities, yet this is currently an under-explored area. In this paper, we introduce a new benchmark for evaluating language models beyond memorisation on abstract reasoning tasks. We perform extensive evaluations of state-of-the-art LLMs, showing that they currently achieve very limited performance in contrast with other natural language tasks, even when applying techniques that have been shown to improve performance on other NLP tasks. We argue that guiding LLM generation to follow causal paths could help improve the generalisation and reasoning abilities of LLMs.},
booktitle = {Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence},
articleno = {693},
numpages = {9},
location = {Jeju, Korea},
series = {IJCAI '24}
}

About

Evaluation on Logical Reasoning and Abstract Reasoning Challenges

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published