The Open LLM Leaderboard by Huggingface is very often used to evaluate large language models (LLMs). It covers various datasets such as:
- ARC - AI2 Reasoning Challenge (7,787 grade-school science questions)
- HellaSwag (multiple choice questions on how to end a sentence)
- TruthfulQA (model's propensity to reproduce falsehoods commonly found online)
- Winogrande (multiple choice questions on machine intelligence)
- GSM8k (diverse grade school math word problems)
In this project, you will analyze what information LLMs will correctly remember. The idea is to generate questions based on a KG of your choice (e.g., Wikidata, DBpedia, etc.) to see if they can be correctly answered. When executed on many models, one can see which models can remember many facts. More specifically, different kinds of questions can be generated out of the KG, e.g.
- do LLMs remember more information on persons, rather than places?
- do LLMs remember more information about cities in the US than in other countries?
- do LLMs remember more information about entities that are more popular (according to Wiki page views reported by Wikipedia or WikiMedia) than long-tail entities
- do LLMs remember dates (e.g., birthdate) better than locations (e.g., birthplace) or vice versa
- can LLMs decide on types of instances or on subclass relations? Which one works better?
One possible way would be to generate multiple-choice questions (similar to HellaSwag). The difficulty level of those questions is something to be explored. Similarly, the question remains how wrong answers (which still make sense) can be generated. In a successful project, we can see which models remember factual information and how biased they are. Depending on the questions asked, such an evaluation dataset can also be used to check which models are suitable for ontology construction (see the last question above).
- RQ1: How well do different pretrained LLMs memorize different topics?
- RQ2: How well can pretrained LLMs decide on knowledge engineering tasks?
- RQ2: How can KGs be leveraged for the evaluation of pretrained LLMs?
- RQ3: How can we automatically create reference datasets from KGs for the evaluation of pretrained LLMs?
- One possible framework to be used is called Language Model Evaluation Harness. It is the framework used by the Open LLM Leaderboard
- To create a new task/dataset, the framework provide a useful introduction
- All implemented tasks are available in GitHub
- To find the corresponding dataset mentioned in the dataset_path attribute (e.g. hellaswag example), you can append the dataset name to the following URI
https://huggingface.co/datasets/
(in the example this will be https://huggingface.co/datasets/hellaswag)
- To find the corresponding dataset mentioned in the dataset_path attribute (e.g. hellaswag example), you can append the dataset name to the following URI
- All possible parameters for the eval config file are reported in the task guide
- To install the evaluation framework, run
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
(also explained on the eval framework homepage)
-
Creating a task involves creating a config file (an example is provided as kg_g.yaml and kg_mc.yaml)
-
If the task configuration file (the yaml file) is not in the tasks folder, then provide a folder containing all tasks files by
--include_path
argument (see the task guide) -
A possible run command looks like the following (it will run the kg_g task for the gpt3.5 model):
- to check detail run the tool with
-w --verbosity DEBUG
--log_samples
can be used to get further detailed results- for all options, execute
lm_eval --help
- to check detail run the tool with
lm_eval --model openai-chat-completions --model_args model=gpt-3.5-turbo --include_path ./ --tasks kg_g --output_path ./results
The tasks can have different output types (see scoring details) like
generate_until
,loglikelihood
,loglikelihood_rolling
, andmultiple_choice
- thus e.g. the Polish PPC dataset has both variants (
ppc_mc
formultiple choice
version andpolish_ppc_regex
forgenerate_until
version)