TIME2025 LLM

This repository contains the code required to generate and evaluate the benchmarks shown in the Assessing The (In)Ability of LLMs To Reason in Interval Temporal Logic paper submitted to the 2025 TIME conference.

Note

The code for dataset generation will be added soon

Setup and run the project

Prepare the working environment:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Set the OpenRouter key:
1. Create .env-openrouter file in the project root
2. Write OPENROUTER_API_KEY="your-key-here" to the file, using your own API key.
Run the benchmarks:

./run_deepseekV3.sh
./run_gemma3.sh
./run_llama4.sh
./run_qwen3_235B.sh
./run_qwen3.sh

Important

Qwen3 models have a specific feature that allows enabling/disabling the reasoning CoT generation by appending /think / /no_think to the prompt. For this reason all the non-CoT tests for the Qwen models are commented. You will need to manually add /no_think to the prompt for these tests to be correctly executed. Also set max_tokens_no_cot to 10.

Move the resulting JSONs to results/data/ (Delete the paper's default data if you don't need them)
Plot the overall results with the command python3 plotter.py results/data
Plot all the results with the command python3 src/generate_plots.py -d results/data
To test the complexity class accuracy:
1. Run a benchmark evaluation using resources/benchmark/complexity_100.json or generate your own by running python3 src/benchmark_to_complexity.py resources/benchmark/all.json
2. Plot the result with python3 src/complexity_plot.py 'result-path-here'

Note

For the complexity class tests you can use any benchmark JSON, but they will not be balanced.

Note

You may need to install LaTeX and add it to the PATH to correctly generate the plots.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TIME2025 LLM

Setup and run the project

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
resources		resources
results/data		results/data
src		src
.gitignore		.gitignore
README.md		README.md
json_utils.py		json_utils.py
plotter.py		plotter.py
requirements.txt		requirements.txt
run_deepseekV3.sh		run_deepseekV3.sh
run_gemma3.sh		run_gemma3.sh
run_llama4.sh		run_llama4.sh
run_qwen3.sh		run_qwen3.sh
run_qwen3_235B.sh		run_qwen3_235B.sh

aclai-lab/TIME2025-LLM

Folders and files

Latest commit

History

Repository files navigation

TIME2025 LLM

Setup and run the project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages