A comprehensive benchmark for evaluating language models' abilities in creative writing, planning, and narrative construction. This benchmark tests models on their capacity to brainstorm, plan, revise, and write complete short stories/novellas from minimal prompts.
This codebase can be used to reproduce results on: https://eqbench.com/creative_writing_longform.html
The benchmark evaluates several key abilities:
- Brainstorming & Planning: Creating a coherent story plan from a minimal prompt
- Critical Reflection: Reviewing and revising the initial plan
- Character Development: Creating detailed character profiles
- Long-form Writing: Producing a complete novella across 8 chapters (~1000 words each)
- Narrative Consistency: Maintaining plot coherence and character consistency throughout
Models on the EQ-Bench leaderboard are evaluated with Claude Sonnet 4 as a judge, although you can use whichever judge you wish. The judge scores outputs across multiple criteria including creativity, coherence, character development, and prose quality.
# 1. Install deps (prefer venv)
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# 2. Copy environment template and fill in keys
cp .env.example .env
$EDITOR .env # set TEST_API_KEY & JUDGE_API_KEY, or OPENAI_API_KEY
# 3. Run one prompt, one iteration, four OS threads
python3 longform_writing_bench.py \
--test-model "google/gemini-2.0-flash-001" \
--judge-model "anthropic/claude-sonnet-4" \
--runs-file "results/longform_bench_runs.json" \
--run-id "demo" \
--threads 12 \
--iterations 1
Tip The sample above assumes OpenRouter endpoints (identical payload shape to OpenAI). If you point
TEST_API_URL
orJUDGE_API_URL
elsewhere, adjust headers inutils/api.py
if needed.
key | default | purpose |
---|---|---|
TEST_API_URL / TEST_API_KEY |
– | endpoint & key for the model under test |
JUDGE_API_URL / JUDGE_API_KEY |
– | endpoint & key for the judge model |
MAX_RETRIES |
5 |
per-request retry limit |
RETRY_DELAY |
5 (s) |
base delay between retries (doubles on 429) |
REQUEST_TIMEOUT |
300 (s) |
hard timeout for any HTTP request |
LOG_VERBOSITY |
INFO |
fallback log level (CLI --verbosity wins) |
Only these six are required; everything else has sane defaults.
-
Initialization
- generates a unique
run_key
<run-id>_<sanitized-model-name>
- writes skeleton entry in
results/longform_bench_runs.json
- generates a unique
-
Generation (13 steps)
- Prompts 1-5: plan and character profiles
- Prompts 6-13: eight ~1000-word chapters
- Uses
temperature=0.7
,min_p=0.1
by default - Saves after every
--save-interval
steps (default 1)
-
Chapter judging
- For each chapter, Claude (or your judge) responds with a rubric and scores (0-20 scale)
-
Final judging
- Judge sees the full story once (configurable via
NUM_FINAL_JUDGMENTS
) and outputs another score block.
- Judge sees the full story once (configurable via
-
Scoring & bootstrap
- Weights chapter scores equally; the final piece can have its own weight (
FINAL_SCORE_WEIGHT
). - Calculates mean ±95 % CI from 500 bootstrap resamples.
- Appends the result under
runs.<run_key>.results.benchmark_results
.
- Weights chapter scores equally; the final piece can have its own weight (
All I/O uses atomic writes with per-file locks (utils/file_io.py
), so parallel threads and crashes won’t corrupt logs.
To resume a killed run, just re-run the same command. Finished steps are skipped automatically. To re-judge with a newer rubric or judge model:
python3 longform_writing_bench.py \
--test-model "openai/gpt-4o" \
--judge-model "anthropic/claude-sonnet-4" \
--runs-file "results/longform_bench_runs.json" \
--run-id "demo" \
--skip-generation \
--redo-judging
.
├─ longform_writing_bench.py # CLI entry point
├─ core/
│ ├─ benchmark.py # orchestration logic
│ ├─ conversation.py # generation & judging task object
│ ├─ scoring.py # parsing, weighting, bootstrap
│ └─ metrics.py # auxiliary text metrics (slop, repetition, complexity)
├─ utils/
│ ├─ api.py # thin wrapper over HTTP calls with retry logic
│ ├─ file_io.py # atomic JSON read/write helpers
│ └─ logging_setup.py
├─ data/ # prompt templates, criteria, slop lists, etc.
└─ results/ # run logs (created at runtime)
- Number of chapters – set
NUM_CHAPTERS
in your environment before running. - Prompt templates – edit
data/prompt*.txt
. If you change the count, adjustNUM_PLANNING_STEPS
. - Criteria weights – see
data/criteria_weights.json
; unknown keys default to 1 × weight. - Negative metrics – any criterion in
data/longform_negative_criteria_*.txt
is automatically inverted.
Python 3.9+ recommended. All runtime deps are in requirements.txt
.
After install, run once:
import nltk
nltk.download("punkt")
nltk.download("cmudict")
nltk.download("stopwords")
Those are needed for core/metrics.py
. If you skip metrics, the main pipeline still works.
-
“No tasks were successfully scored.” Check judge responses in
results/…/longform_tasks.*.final_raw_judge_texts
– scores may be missing or mis-parsed. -
Rate-limited (429). The runner backs off exponentially. Increase
RETRY_DELAY
or request a higher quota. -
Long story cuts off early. Bump
max_tokens
insideLongformCreativeTask.run_generation_sequence
for chapter steps. -
Threads overwhelm your API provider. Lower
--threads
. Generation is CPU-light; I/O wait dominates.
MIT. Attribution appreciated but not required.
If this benchmark contributed to your research, cite it as:
@misc{paech2025longform,
author = {Paech, S.J.},
title = {Longform Creative Writing Benchmark},
year = {2025},
url = {https://github.com/EQ-bench/longform-writing-bench},
note = {GitHub repository}
}