The benchmarks used are:
- BIG-BENCH's collection of 5-minute mysteries: https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/minute_mysteries_qa
- MuSR's synthetically created mysteries (code, paper)
See also the original site to solve 5-minute mysteries as humans.
- graph asks the LLM for exonerating and incriminating evidence for each suspect and symbolically combines the result.
Inspired by LMs for Rationality,
- mystery transforms a belief graph into a MaxSAT problem to be optimized for debugging consistency;
- belief_graph elicits a belief graph for a story.
Inpsired by DetectBench:
- main_detect_bench_prompt uses DetectBench's multi-chain of thought prompts.
- claude_35_detect_bench.md logs the run result.
- tms is a basic probabilistic Truth-Maintenance System (TMS) with a MaxSAT backend.
- tms_z3 is the Z3 backend.
- tms_rc2 is the RC2 backend.
- tms_mystery is mystery re-written to use the TMS.
git submodule init
git submodule update
pip install git+https://github.com/huggingface/transformers.git
pip install "outlines @ git+https://github.com/outlines-dev/outlines.git@main"
pip install z3-solver
pip install python-sat
pip install datasets
pip install ollama
pip install openai
pip install outlines
pip install wandb
ollama pull qwen2.5
ollama pull deepseek-r1