ISWS 2024 Project Idea - What do Large Language Models know and remember?

The Open LLM Leaderboard by Huggingface is very often used to evaluate large language models (LLMs). It covers various datasets such as:

ARC - AI2 Reasoning Challenge (7,787 grade-school science questions)
- paper, eval config, dataset
HellaSwag (multiple choice questions on how to end a sentence)
- paper, eval config, dataset
TruthfulQA (model's propensity to reproduce falsehoods commonly found online)
- paper, eval config, dataset
Winogrande (multiple choice questions on machine intelligence)
- paper, eval config, dataset
GSM8k (diverse grade school math word problems)
- paper, eval config, dataset

In this project, you will analyze what information LLMs will correctly remember. The idea is to generate questions based on a KG of your choice (e.g., Wikidata, DBpedia, etc.) to see if they can be correctly answered. When executed on many models, one can see which models can remember many facts. More specifically, different kinds of questions can be generated out of the KG, e.g.

do LLMs remember more information on persons, rather than places?
do LLMs remember more information about cities in the US than in other countries?
do LLMs remember more information about entities that are more popular (according to Wiki page views reported by Wikipedia or WikiMedia) than long-tail entities
do LLMs remember dates (e.g., birthdate) better than locations (e.g., birthplace) or vice versa
can LLMs decide on types of instances or on subclass relations? Which one works better?

One possible way would be to generate multiple-choice questions (similar to HellaSwag). The difficulty level of those questions is something to be explored. Similarly, the question remains how wrong answers (which still make sense) can be generated. In a successful project, we can see which models remember factual information and how biased they are. Depending on the questions asked, such an evaluation dataset can also be used to check which models are suitable for ontology construction (see the last question above).

Research Questions

RQ1: How well do different pretrained LLMs memorize different topics?
RQ2: How well can pretrained LLMs decide on knowledge engineering tasks?
RQ2: How can KGs be leveraged for the evaluation of pretrained LLMs?
RQ3: How can we automatically create reference datasets from KGs for the evaluation of pretrained LLMs?

Implementation Details

One possible framework to be used is called Language Model Evaluation Harness. It is the framework used by the Open LLM Leaderboard
To create a new task/dataset, the framework provide a useful introduction
All implemented tasks are available in GitHub
- To find the corresponding dataset mentioned in the dataset_path attribute (e.g. hellaswag example), you can append the dataset name to the following URI https://huggingface.co/datasets/ (in the example this will be https://huggingface.co/datasets/hellaswag)
All possible parameters for the eval config file are reported in the task guide

How to run

To install the evaluation framework, run

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

(also explained on the eval framework homepage)

Creating a task involves creating a config file (an example is provided as kg_g.yaml and kg_mc.yaml)
If the task configuration file (the yaml file) is not in the tasks folder, then provide a folder containing all tasks files by --include_path argument (see the task guide)
A possible run command looks like the following (it will run the kg_g task for the gpt3.5 model):
- to check detail run the tool with -w --verbosity DEBUG
- --log_samples can be used to get further detailed results
- for all options, execute lm_eval --help

lm_eval --model openai-chat-completions --model_args model=gpt-3.5-turbo --include_path ./ --tasks kg_g  --output_path ./results

OpenAI Models and the Multiple Choice Questions

The tasks can have different output types (see scoring details) like

generate_until, loglikelihood, loglikelihood_rolling, and multiple_choice
- generate_until works with all models (inlcuding proprietary models like ChatGPT)
- all others do usually only work with open source models because it requires to get access to the logits of the model. In case you are further interested, see issues #1196 and #1704
thus e.g. the Polish PPC dataset has both variants (ppc_mc for multiple choice version and polish_ppc_regex for generate_until version)

Contact information

Harald Sack (mail)
Sven Hertling (mail)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
generate_dataset.py		generate_dataset.py
kb_test.json		kb_test.json
kb_train.json		kb_train.json
kg_g.yaml		kg_g.yaml
kg_mc.yaml		kg_mc.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ISWS 2024 Project Idea - What do Large Language Models know and remember?

Research Questions

Implementation Details

How to run

OpenAI Models and the Multiple Choice Questions

Contact information

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ISE-FIZKarlsruhe/ISWS2024

Folders and files

Latest commit

History

Repository files navigation

ISWS 2024 Project Idea - What do Large Language Models know and remember?

Research Questions

Implementation Details

How to run

OpenAI Models and the Multiple Choice Questions

Contact information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages