The creative psychometric item generator: a framework for item generation and validation using large language models

Abstract

Increasingly, large language models (LLMs) are being used to automate workplace processes requiring a high degree of creativity. While much prior work has examined the creativity of LLMs, there has been little research on whether they can generate valid creativity assessments for humans despite the increasingly central role of creativity in modern economies. We develop a psychometrically inspired framework for creating test items (questions) for a classic free-response creativity test: the creative problem solving (CPS) task. Our framework, the creative psychometric item generator (CPIG), uses a mixture of LLM-based item generators and evaluators to iteratively develop new prompts for writing CPS items, such that items from later iterations will elicit more creative responses from test takers. We find strong empirical evidence that CPIG generates valid and reliable items and that this effect is not attributable to known biases in the evaluation process. Our findings have implications for employing LLMs to automatically generate valid and reliable creativity tests for humans and AI.

Repository Description

Contains code and data for reproducing our results from our CREAI paper. All items and item responses from our experiments can be found here

Setup

Requirements

Python 3.x, we reccomend the latest version from Anaconda.
pandas
numpy
transformers
langchain
openai
anthropic
accelerate
bitsandbytes
pytorch
nltk
readability
tqdm
peft
evaluate
scipy

You can install these into your anaconda enviroment by simply using:

pip install requirements.txt

Important scripts

RunExperiment.py, the main driver script for running trials, you should run this script to reproduce results.
ItemGeneration.py, code for generating CPS items.
GenerateCPSResponses.py, code for generating CPS item responses.
RLPS_RoBERTa.py, code for finetuning the originality model. Note that this is slightly modified from the original to include peft.
SelectItemGenShots.py, code for shot selection methods.
ItemEvaluation.py, an experimental script for generating LLM evaluations of LLM items, not used in the final submission.
Prompts.py, all prompts used, stored as lists that may be added onto.
config.py, defines the paramters used for an experiment, discussed in detail below.

Setting up the config

random_seed: the seed for numpy / pytorch / huggingface. This is ingnored for Claude.
numIter: how many iterations to run.
itemGenModelName: the name of the item generator. Use either a huggingface model id or claude-3 for Claude-haiku.
itemResponseGenModelName: the name of the item response generator. Use either a huggingface model id or claude-3 for Claude-haiku.
itemGenPromptIdx: the index of the prompt for item generation in Prompts.py.
itemResponseGenPromptIdx: the index of the prompt for item response generation in Prompts.py.
itemGenMaxTokens: how many tokens to cap item generation.
itemResponseGenMaxTokens: how many tokens to cap item response generation.
demographicsFile: The file to fetch demographic / psychometric data for demographic / psychometric prompts. We store the csvs we used under ./creativity-item-generation/optimize_item_gen_prompt/data/. Use PsychometricData.csv for psychometric prompts, and DemographicData.csv for demographic prompts. Set to None for the no context prompt.
itemGenOutputFile, itemResponseGenOutputFile: where to save generated items and item response, stored in separate json files. Ideally these should point to the same folder, but this is not a requriement.
numItemGenerationAttempts: how many times to retry generation if it fails a quality control check.
itemResponseOriginalityModelDir: the directory of the pre-trained originality scoring model. We are unable to provide the pre-trained weights, but the authors from the cited work can be contacted to obtain them.
itemGenNumShots: The size of k to use as exemplars.
shotSelectionAlgorithm: The algorithm used for shot selection, one of random, greedy, constraint satisfaction.
numResponsesPerItem: how many LLM responses to generate per item.

Running the code

Install the listed packages into a Python enviroment.
Obtain a Claude API key, the driver expects a key.py in the same directory with the key stored as a string named ANTHROPIC_KEY. You may additionally obtain an OpenAI key and store it under OPENAI_KEY.
Set the hyperparamter settings under config.py, and create the directory where results will be stored.
Run RunExperiment.py

Please refer to the authors of the originality scorer (referenced in the paper) to access the weights of the scoring model. Note that you must have the weights saved locally for originality scoring.

Citation

If you find this work helpful, please cite us:

@inproceedings{laverghetta_luchini_linell_reiter-palmon_beaty_2024,  
    title={The creative psychometric item generator: a framework for item generation and validation using large language models},  
    booktitle={CREAI 2024: International Workshop on Artificial Intelligence and Creativity.},  publisher={CEUR-WS},  
    author={Laverghetta, Antonio and Luchini, Simone and Linell, Averie and Reiter-Palmon, Roni and Beaty, Roger},  
    year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
code		code
CPIG.PNG		CPIG.PNG
CREAI-Appendix.pdf		CREAI-Appendix.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The creative psychometric item generator: a framework for item generation and validation using large language models

Abstract

Repository Description

Setup

Requirements

Important scripts

Setting up the config

Running the code

Citation

About

Uh oh!

Releases

Packages

Languages

License

Beaty-Lab/CREAI-item-generation

Folders and files

Latest commit

History

Repository files navigation

The creative psychometric item generator: a framework for item generation and validation using large language models

Abstract

Repository Description

Setup

Requirements

Important scripts

Setting up the config

Running the code

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages