HoarePrompt structurally reasons about program correctness in natural language using large language models (LLMs). Drawing inspiration from the strongest postcondition calculus, it employs a step-by-step process to generate natural language descriptions of reachable program states at various program points. By aligning formal semantics with informal requirements, HoarePrompt enables LLMs to detect bugs that violate natural language specifications.
Key features:
- Correctness assessment: classify program correctness with respect to natural language specifications, providing a verdict of
CORRECT
orINCORRECT
. - Descriptions of reachable states: infer natural language descriptions of reachable program states without executing the program.
- Loop summarization: apply a novel few-shot-driven k-induction to precisely summarize the semantics of loops in natural language.
Preprint:
- HoarePrompt: Structural Reasoning About Program Correctness in Natural Language
Dimitrios Stamatios Bouras, Yihan Dai, Tairan Wang, Yingfei Xiong, Sergey Mechtaev
https://arxiv.org/abs/2503.19599
To ensure dependency isolation and compatibility, it's recommended to create a virtual environment:
# Create a virtual environment
python3 -m venv hoareprompt-env
# Activate the virtual environment
source hoareprompt-env/bin/activate
Install the necessary dependencies by running:
pip install -r requirements.txt
Depending on the LLM service you're using, set environment variables for the API keys.
For OpenAI models:
export OPENAI_API_KEY="your-openai-api-key"
For Qwen models:
export DASHSCOPE_API_KEY="your-dashscope-api-key"
You can add these export commands to your .bashrc
or .zshrc
file to avoid having to set them every time.
Additional providers and models can be configured in src/model.py
.
To motivate our approach, we first show that annotating a program with natural language descriptions of reachable program states enhances an LLM's reasoning about program correctness.
Consider the problem: Given an integer list, the task is to find the maximum product of any sublist. The left side of the figure below presents a flawed implementation of Kadane’s algorithm for solving this problem.
The description and code for this example can be found at examples/motivating_example/description.txt
and examples/motivating_example/program.py
.
The bug in this implementation stems from incorrect indentation, which results in best_so_far
being updated only once—after the loop ends—instead of being updated during each iteration.
Despite the simplicity of this bug, state-of-the-art LLMs often fail to identify it consistently, misclassifying the program as correct. Because LLMs exhibit non-deterministic behavior and are sensitive to prompt design, testing their effectiveness in detecting such errors is challenging. To provide objective evidence, we crafted six distinct prompts—including a simple "Vanilla" prompt, a Zero-shot-CoT prompt, and an open-ended prompt—each containing the problem description and the code snippet. We generated 10 responses for each prompt using three different LLMs and various temperature settings.
Model | Not Annotated | Annotated |
---|---|---|
Qwen2.5-7B | 0% | 11.7% |
Qwen2.5-Coder-32B | 15% | 85% |
Qwen2.5-72B | 35% | 85% |
The results above show that without state annotations, LLMs rarely detect the bug. An example of an incorrect response from the Vanilla prompt is presented on the left side of the figure. However, HoarePrompt enhances this process by systematically inferring natural preconditions and postconditions from the problem description, then propagating reachable program states across the control flow.
By iteratively applying this process, HoarePrompt constructs a structured representation of program states, making it easier for LLMs to reason about correctness. With these annotations, the bug detection rate increases significantly.
You can run the motivational example using the following command:
python src/hoareprompt.py --description examples/motivating_example/description.txt --program examples/motivating_example/program.py --config configs/config_motivational.json
Alternatively, specify the type of command and the log path too:
python src/hoareprompt.py --command assess --description examples/motivating_example/description.txt --program examples/motivating_example/program.py --config configs/config_motivational.json --log logs/motivating_example/
HoarePrompt provides several commands to analyze programs.
This command evaluates whether a program conforms to a given problem description. If no specific command is provided, HoarePrompt defaults to assess
.
Usage:
python src/hoareprompt.py --description <FILE> --program <FILE>
OR
python src/hoareprompt.py --command assess --description <FILE> --program <FILE>
OR
python src/hoareprompt.py --command assess --description <FILE> --program <FILE> --config <FILE> --log <FILE>
This outputs either CORRECT
or INCORRECT
based on the program assessment.
Generates a precondition from the problem description.
Usage:
python src/hoareprompt.py --command extract-precondition --description example/description.txt --program example/program.py
Computes the postcondition for a program fragment.
Usage:
python src/hoareprompt.py --command compute-postcondition --precondition example/precondition.txt --program example/program.py
Verifies if the computed postcondition satisfies the problem description.
Usage:
python src/hoareprompt.py --command check-entailment --description example/description.txt --program example/program.py --postcondition example/postcondition.txt
- Log Directory: Specify
--log
to save detailed logs. - Counterexample Generation: Use
--cex <FILE>
withassess
orcheck-entailment
to generate counterexamples.
HoarePrompt allows users to customize its behavior using a JSON configuration file. By default, it uses default-config.json
.
Specify a custom config:
python src/hoareprompt.py --config <FILE>
naive
: Directly assesses if the program aligns with its description.postcondition-entailment
: Uses postcondition analysis to check specification adherence.naive-test
: Asks the LLM to generate input-output test pairs to determine correctness. Always used with 'verify-answer' as entailement-mode.
-
loop-unrolling-count
: Controls loop unrolling (set0
to disable summarization). -
concat-simple
: Determines whether simple statements are combined for token efficiency (defaulttrue
). -
annotated
: Enables annotated code trees for entailment checking. -
fsl
: Enables Few-Shot Learning (true
by default). -
COT
: Use or don't use Chain of Thgough reasoning with vanilla clasisifiers -
entailment-mode
: Options includenaive
,cot
, andverify-answer
.
Predefined configurations for classifiers used in the HoarePrompt paper are available in the configs
folder:
- Vanilla:
configs/config_tester_*.json
- Zero-Shot-CoT:
configs/config_zero_shot_COT_*.json
- Tester-based:
configs/config_tester_*.json
- HoarePrompt (3 unrolls):
configs/config_hoareprompt_*.json
- HoarePrompt (no unrolls):
configs/config_hoareprompt_no_unroll_*.json
The HoarePrompt project consists of multiple repositories, each serving a specific function:
-
HoarePrompt (Main Repo) (This is the current repo you are on)
- The core repository containing the implementation of HoarePrompt, including the logic for analyzing programs, state propagation, and correctness verification.
-
- A dedicated repository for storing experimental results and datasets related to HoarePrompt evaluations. This includes correctness assessments, counterexamples, and other findings from our testing.
-
- This repository provides scripts and configurations to run large-scale experiments with HoarePrompt on various datasets. It includes automated evaluation scripts and batch processing tools.
-
- a dataset of recent submissions to Codeforces contests, and associated documentation.
This dataset is released under the MIT License. See the LICENSE file for details.