The creative psychometric item generator: a framework for item generation and validation using large language models
Increasingly, large language models (LLMs) are being used to automate workplace processes requiring a high degree of creativity. While much prior work has examined the creativity of LLMs, there has been little research on whether they can generate valid creativity assessments for humans despite the increasingly central role of creativity in modern economies. We develop a psychometrically inspired framework for creating test items (questions) for a classic free-response creativity test: the creative problem solving (CPS) task. Our framework, the creative psychometric item generator (CPIG), uses a mixture of LLM-based item generators and evaluators to iteratively develop new prompts for writing CPS items, such that items from later iterations will elicit more creative responses from test takers. We find strong empirical evidence that CPIG generates valid and reliable items and that this effect is not attributable to known biases in the evaluation process. Our findings have implications for employing LLMs to automatically generate valid and reliable creativity tests for humans and AI.
Contains code and data for reproducing our results from our CREAI paper. All items and item responses from our experiments can be found here
- Python 3.x, we reccomend the latest version from Anaconda.
- pandas
- numpy
- transformers
- langchain
- openai
- anthropic
- accelerate
- bitsandbytes
- pytorch
- nltk
- readability
- tqdm
- peft
- evaluate
- scipy
You can install these into your anaconda enviroment by simply using:
pip install requirements.txt
RunExperiment.py
, the main driver script for running trials, you should run this script to reproduce results.ItemGeneration.py
, code for generating CPS items.GenerateCPSResponses.py
, code for generating CPS item responses.RLPS_RoBERTa.py
, code for finetuning the originality model. Note that this is slightly modified from the original to includepeft
.SelectItemGenShots.py
, code for shot selection methods.ItemEvaluation.py
, an experimental script for generating LLM evaluations of LLM items, not used in the final submission.Prompts.py
, all prompts used, stored as lists that may be added onto.config.py
, defines the paramters used for an experiment, discussed in detail below.
random_seed
: the seed for numpy / pytorch / huggingface. This is ingnored for Claude.numIter
: how many iterations to run.itemGenModelName
: the name of the item generator. Use either a huggingface model id orclaude-3
for Claude-haiku.itemResponseGenModelName
: the name of the item response generator. Use either a huggingface model id orclaude-3
for Claude-haiku.itemGenPromptIdx
: the index of the prompt for item generation inPrompts.py
.itemResponseGenPromptIdx
: the index of the prompt for item response generation inPrompts.py
.itemGenMaxTokens
: how many tokens to cap item generation.itemResponseGenMaxTokens
: how many tokens to cap item response generation.demographicsFile
: The file to fetch demographic / psychometric data for demographic / psychometric prompts. We store the csvs we used under./creativity-item-generation/optimize_item_gen_prompt/data/
. UsePsychometricData.csv
for psychometric prompts, andDemographicData.csv
for demographic prompts. Set toNone
for the no context prompt.itemGenOutputFile
,itemResponseGenOutputFile
: where to save generated items and item response, stored in separate json files. Ideally these should point to the same folder, but this is not a requriement.numItemGenerationAttempts
: how many times to retry generation if it fails a quality control check.itemResponseOriginalityModelDir
: the directory of the pre-trained originality scoring model. We are unable to provide the pre-trained weights, but the authors from the cited work can be contacted to obtain them.itemGenNumShots
: The size ofk
to use as exemplars.shotSelectionAlgorithm
: The algorithm used for shot selection, one ofrandom
,greedy
,constraint satisfaction
.numResponsesPerItem
: how many LLM responses to generate per item.
- Install the listed packages into a Python enviroment.
- Obtain a Claude API key, the driver expects a
key.py
in the same directory with the key stored as a string namedANTHROPIC_KEY
. You may additionally obtain an OpenAI key and store it underOPENAI_KEY
. - Set the hyperparamter settings under
config.py
, and create the directory where results will be stored. - Run
RunExperiment.py
Please refer to the authors of the originality scorer (referenced in the paper) to access the weights of the scoring model. Note that you must have the weights saved locally for originality scoring.
If you find this work helpful, please cite us:
@inproceedings{laverghetta_luchini_linell_reiter-palmon_beaty_2024,
title={The creative psychometric item generator: a framework for item generation and validation using large language models},
booktitle={CREAI 2024: International Workshop on Artificial Intelligence and Creativity.}, publisher={CEUR-WS},
author={Laverghetta, Antonio and Luchini, Simone and Linell, Averie and Reiter-Palmon, Roni and Beaty, Roger},
year={2024}
}