Reliable Eval

This repository contains the code for the paper "ReliableEval: a Recipe for Stochastic LLM Evaluation via Method of Moments".

Execution commands

To generate data:

python scripts/preprocess/sample_from_dove.py \
    --path_to_dove /path/to/DOVE/repo \
    --num_resamplings 100 \
    --out data/gpqa/100_resamplings.json

To run model inference:

python scripts/inference/run_openai_api.py \
  --data data/simple_qa/only_5_resamplings.json \
  --out data/simple_qa/predictions/Grok-3_predictions.json \
  --temp 0.1 --platform xai --model grok-3 --batch_size 100 --max_tokens 30

To run the evaluation, you first need to combine all model predictions into a single file:

python scripts/post_process/combine_all_predictions.py \
    --output predictions/gpqa_combined.json \
    --data data/gpqa/100_resamplings.json \
    --model GPT-4o \
    --model Llama-3.3-70B \
    --model Deepseek-v3 \
    --predictions_dir data/gpqa/predictions

Then, eval the combined json:

python scripts/post_process/evaluate_resamplings.py \
    --predictions predictions/gpqa_combined.json \
    --out results/gpqa_results.json \
    --path_to_dove /path/to/DOVE/repo

To analyze convergence:

python scripts/post_process/analyze_100_resamplings.py \
    --path_to_scores results/gpqa_results.json \ 
    --model Llama-3.3-70B \ 
    --epsilon 0.01 --delta 0.01 \
    --out_dir figures/gpqa/100_resamplings/ \
    --dataset_name GPQA

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reliable Eval

Execution commands

About

Uh oh!

Releases

Packages

Languages

SLAB-NLP/Reliable-Eval

Folders and files

Latest commit

History

Repository files navigation

Reliable Eval

Execution commands

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages