This repo has the code for three papers:
- The code in 'plan-bench' subdirectory belongs to the paper "PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change"
- The code in 'llm_planning_analysis' subdirectory belongs to the paper "On the Planning Abilities of Large Language Models--A Critical Investigation"
- NEW: 'llm_planning_analysis' subdirectory also contains the code for the paper "A Systematic Evaluation of the Planning and Scheduling Abilities of the Reasoning Model o1"
The leaderboard below shows the performance of the models on the PlanBench static test set with zero-shot prompting. Check out llm_planning_analysis/results/ folder for the detailed files. For Blocksworld Hard, the results are included in results/backprompting/ folder.
Model Name | Model Type | Blocksworld - NL - 600 instances | Mystery Blocksworld - NL - 600 instances | Randomized Mystery Blocksworld - NL - 600 instances | Blocksworld Hard - PDDL - 110 instances |
---|---|---|---|---|---|
Deepseek R1 | LRM | 99.1% | 43.3% | 25.8% | 53.6% |
o1-preview | LRM | 97.8% | 52.8% | 37.3% | 23.65% |
o1-mini | LRM | 56.6% | 19.1% | 3.5% | 10% |
Claude-3.5 Sonnet | LLM | 54.8% | 0% | - | - |
GPT-4o | LLM | 35.5% | 0% | - | - |
LLaMA-3.1 405B | LLM | 62.6% | 0.8% | - | - |
Claude 3 Opus | LLM | 59.3% | 0% | - | - |
LLaMA-3 70B | LLM | 34.16% | 0% | - | - |
GPT-4 | LLM | 34.6% | 0% | - | - |
Gemini 1.5 Pro | LLM | 23.8% | - | - | - |
Note: LLM = Large Language Model, LRM = Language Reasoning Model, NL = Natural Language Prompting, PDDL = Planning Domain Definition Language Prompting
Kindly submit results of any new models by submitting a pull request with the result file and the leaderboard will be updated.
PlanBench - NeurIPS 2023 Datasets and Benchmarks Track:
@article{valmeekam2023planbench,
title={Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change},
author={Valmeekam, Karthik and Marquez, Matthew and Olmo, Alberto and Sreedharan, Sarath and Kambhampati, Subbarao},
journal={Advances in Neural Information Processing Systems},
volume={36},
pages={38975--38987},
year={2023}
}
On the Planning Abilities of Large Language Models - NeurIPS 2023 Spotlight:
@article{valmeekam2023planning,
title={On the planning abilities of large language models-a critical investigation},
author={Valmeekam, Karthik and Marquez, Matthew and Sreedharan, Sarath and Kambhampati, Subbarao},
journal={Advances in Neural Information Processing Systems},
volume={36},
pages={75993--76005},
year={2023}
}
A Systematic Evaluation of the Planning and Scheduling Abilities of the Reasoning Model o1 - TMLR:
@article{valmeekam2025a,
title={A Systematic Evaluation of the Planning and Scheduling Abilities of the Reasoning Model o1},
author={Karthik Valmeekam and Kaya Stechly and Atharva Gundawar and Subbarao Kambhampati},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=FkKBxp0FhR},
note={}
}