Evaluating Creativity in Human and Large Language Model Narratives
This repository contains all code and instructions for the experiments carried out by Roberto Passaro, Marta Pavanati, Clotilde Frapiccini, and Anuoluwapo Aremu for an MSc course at CIMeC, University of Trento.
We compare 21 human‑written short stories against 21 GPT‑4.1 continuations (×7 temperatures) using four automated creativity metrics:
- Novelty
- Surprise
- Lexical Diversity
- Semantic Diversity
Creativity has long been regarded as an exclusively human capacity, but the emergence of large language models (LLMs) raises the question of whether these models can achieve human-comparable levels of creativity in generating narratives. In this study, we compared 21 short stories generated by ChatGPT 4.1 with 21 human-written counterparts, measuring creativity using four automatic metrics: novelty, surprise, lexical diversity, and semantic diversity. We also investigated the impact of temperature settings - a randomness control hyperparameter often associated with creative variation - on narrative creativity. We observed that novelty exhibits temperature-dependent trends, occasionally approaching human levels, and that GPT-4.1 consistently outperforms human authors in lexical diversity and surprise, while semantic diversity is consistently influenced by temperature and remains lower than that of humans. Building on these findings, we conclude by reasoning about the factors that drive the observed differences in narrative creation between humans and LLMs.
The experiments can be run directly in the provided Colab notebook without local setup:
- Open
notebooks/Tales_of_Two_Minds.ipynb
in Google Colab. - Upload a ZIP file containing your 21 human-written
.txt
stories namedstory_01.txt
throughstory_21.txt
. - Enter your OpenAI API key when prompted.
- Run all cells.
All dependencies (Python libraries and the spaCy model) are installed automatically within the Colab environment.
The code expects 21 human-written stories in human_texts/
as UTF-8 .txt
files named story_01.txt
, story_02.txt
, …, story_21.txt
. We have not committed the full set here. To rerun the experiment and obtain the human dataset, please email roberto.passaro@studenti.unitn.it
.
Set your OpenAI key and run:
OPENAI_API_KEY = "your key"
This will:
- Generate 21 GPT-4.1 continuations at each of seven temperature settings.
- Preprocess all texts (lemmatization, filtering).
- Compute novelty, surprise, lexical diversity, semantic diversity, and save
scores.csv
. - Produce summary stats and plots in
statplots/
.
- scores.csv: per-story metrics
- lexical_diversity_summary.csv: mean ± std by source & temperature
- statplots/: boxplots, QQ-plots, correlation heatmaps, p-value tables
- Fork the repo
- Create a feature branch
- Open a Pull Request
This project is released under the MIT License.
Roberto Passaro – <roberto.passaro@studenti.unitn.it>