llm-mysteries

Benchmarks

The benchmarks used are:

BIG-BENCH's collection of 5-minute mysteries: https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/minute_mysteries_qa
MuSR's synthetically created mysteries (code, paper)

See also the original site to solve 5-minute mysteries as humans.

Highlights

Attempt 1

Base lines are generated by main and main_gpt.

Attempt 2

graph asks the LLM for exonerating and incriminating evidence for each suspect and symbolically combines the result.

Attempt 3

Inspired by LMs for Rationality,

mystery transforms a belief graph into a MaxSAT problem to be optimized for debugging consistency;
belief_graph elicits a belief graph for a story.

Attempt 4

Inpsired by DetectBench:

main_detect_bench_prompt uses DetectBench's multi-chain of thought prompts.
claude_35_detect_bench.md logs the run result.

Libraries

Truth-Maintenance System

tms is a basic probabilistic Truth-Maintenance System (TMS) with a MaxSAT backend.
- tms_z3 is the Z3 backend.
- tms_rc2 is the RC2 backend.
- tms_mystery is mystery re-written to use the TMS.

Installation

In this repo

git submodule init
git submodule update

Python env (legacy)

pip install git+https://github.com/huggingface/transformers.git
pip install "outlines @ git+https://github.com/outlines-dev/outlines.git@main"
pip install z3-solver
pip install python-sat

Python env (deepseek-r1 using Ollama)

pip install datasets
pip install ollama
pip install openai
pip install outlines
pip install wandb
ollama pull qwen2.5
ollama pull deepseek-r1

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
.io.livecode.ch		.io.livecode.ch
BIG-bench @ 6436ed1		BIG-bench @ 6436ed1
MuSR @ c0d740d		MuSR @ c0d740d
log		log
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
belief_graph.py		belief_graph.py
claude.py		claude.py
common_wandb.py		common_wandb.py
deepseek_ollama.py		deepseek_ollama.py
essential.py		essential.py
extractor.py		extractor.py
extractor_debug.py		extractor_debug.py
extractor_ollama.py		extractor_ollama.py
extractor_test.py		extractor_test.py
gpt.py		gpt.py
graph.py		graph.py
llm.py		llm.py
load.py		load.py
load_musr.py		load_musr.py
main.py		main.py
main_deepseek_ollama.py		main_deepseek_ollama.py
main_detect_bench_prompt.py		main_detect_bench_prompt.py
main_gpt.py		main_gpt.py
main_musr.py		main_musr.py
markers.py		markers.py
mystery.py		mystery.py
mystery_extra_tests.py		mystery_extra_tests.py
sequence_probabilities.py		sequence_probabilities.py
stories.py		stories.py
tiny.json		tiny.json
tms.py		tms.py
tms_mystery.py		tms_mystery.py
tms_rc2.py		tms_rc2.py
tms_tests.py		tms_tests.py
tms_z3.py		tms_z3.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llm-mysteries

Benchmarks

Highlights

Attempt 1

Attempt 2

Attempt 3

Attempt 4

Libraries

Truth-Maintenance System

Installation

In this repo

Python env (legacy)

Python env (deepseek-r1 using Ollama)

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

metareflection/llm-mysteries

Folders and files

Latest commit

History

Repository files navigation

llm-mysteries

Benchmarks

Highlights

Attempt 1

Attempt 2

Attempt 3

Attempt 4

Libraries

Truth-Maintenance System

Installation

In this repo

Python env (legacy)

Python env (deepseek-r1 using Ollama)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages