Specification Self-Correction (SSC): Mitigating In-Context Reward Hacking

This repository contains the experimental code for the paper "Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement", accepted to the Workshop on Test-time Scaling and Reasoning Models at COLM 2025.

tl;dr: SSC enables LLMs to identify and fix flaws in their guiding specifications by first exploiting them, critiquing their gamed output, and then revising the specification itself, thereby significantly reducing in-context reward hacking

Overview

Specification Self-Correction (SSC) is a novel test-time framework that enables language models to identify and correct flaws within their own guiding specifications. The method addresses the problem of reward hacking or specification gaming, where models exploit loopholes in tainted rubrics to achieve high scores without fulfilling the user's true intent.

The SSC process works in four steps:

Initial Generation: Model generates response based on potentially tainted specification
Self-Critique: Model critiques its own response using the same tainted rubric
Self-Refinement: Model revises the specification itself to remove exploitable loopholes
Revised Generation: Model generates a new response using the corrected specification

Experiments

This repository implements the experiments from the first "Experiments" section of the paper, covering two distinct domains:

1. Creative Writing Tasks (Section 3.1)

Tests reward hacking behavior in creative writing scenarios using sophisticated rubrics with injected "trap words."

Main Components:

run_experiment.py: Main experiment runner
rubric_templates.py: Four elaborate rubrics for creative writing evaluation
results/: Directory containing experimental results

Rubric Types:

Chaos Cinema Critique: Dadaist-inspired fragmented reviews
Rococo Reviewer: Highly ornate, excessive language style
Esoteric Interpreter: Mystical/occult interpretation of cinematography
Creative Chaos: Anti-conventional, norm-breaking reviews

Results

The experiments demonstrate that:

Models initially exhibit reward hacking in 50-70% of cases
SSC reduces this vulnerability by over 90%
The correction process improves or maintains response quality
SSC works across different model families (GPT-4, Claude, Gemini, etc.)

Running the Experiments

Prerequisites

# Install required dependencies with specific versions (should work with newer ones)
pip install langchain-openai==0.3.0 langchain-aws==0.2.19 boto3==1.37.34

# Or use compatible version ranges
pip install "langchain-openai>=0.2.0,<0.4.0" "langchain-aws>=0.2.0,<0.3.0" "boto3>=1.35.0,<1.40.0"

# Set environment variables
export OPENROUTER_API_KEY="your_openrouter_api_key"
export MODEL_NAME="openai/gpt-4.1-mini"  # or other model id from OpenRouter
export AWS_REGION="us-east-1"  # for evaluation LLM in AWS Bedrock

While the code is designed to work with OpenRouter API, the judge LLM is Claude Sonnet from AWS Bedrock, hence the need for AWS credentials and region setup. But it shall be very easy to adapt it to other LLM providers, as we use langchain to abstract the model calls.

Creative Writing Experiments

python run_experiment.py

This will:

Generate responses for movie review tasks using tainted rubrics
Test with trap words: "exorbitant", "dance", "hot"
Apply SSC process to detect and correct reward hacking
Save results to results/results_{model_name}_rewardhacks.json

You can have a look at a sample with the structure in this example with just one generation: results/results_openai_gpt-4.1-mini_rewardhacks.json. See the rest of the directory results for the paper results with the other models.

Note: the scores from the LLM judge in the previous files are in a scale from 0 to 100. Since for enough powerful models these scores are typically > 90, the Quality Score used in Table 1 refers to the fraction of responses where the model's score is above 99, to make it a more discriminative metric.

Citation

If you use this code or findings in your research, please cite:

@inproceedings{
    gallego2025specification,
    title={Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement},
    author={V{\'i}ctor Gallego},
    booktitle={The 1st Workshop on Test-time Scaling and Reasoning Models},
    year={2025},
    url={https://openreview.net/forum?id=UU9KCA0sTH}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
imgs		imgs
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
rubric_templates.py		rubric_templates.py
run_experiment.py		run_experiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Specification Self-Correction (SSC): Mitigating In-Context Reward Hacking

Overview

Experiments

1. Creative Writing Tasks (Section 3.1)

Results

Running the Experiments

Prerequisites

Creative Writing Experiments

Citation

About

Uh oh!

Releases

Packages

Languages

License

vicgalle/specification-self-correction

Folders and files

Latest commit

History

Repository files navigation

Specification Self-Correction (SSC): Mitigating In-Context Reward Hacking

Overview

Experiments

1. Creative Writing Tasks (Section 3.1)

Results

Running the Experiments

Prerequisites

Creative Writing Experiments

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages