SimpleQA Evaluator

A tool for evaluating and comparing different Question Answering (QA) policies on OpenAI's SimpleQA benchmark, measuring metrics like f-score, accuracy, and latency.

Features

The evaluator provides comprehensive testing capabilities for AI search engines including Linkup Deep, Linkup Standard, and Tavily APIs. It supports both single-policy evaluation and head-to-head comparisons, with built-in async processing and progress tracking.

Setup

Install dependencies:

pip install pandas tqdm python-dotenv linkup-sdk tavily-python openai

Create a .env file with your API keys:

LINKUP_API_KEY=your_linkup_key
TAVILY_API_KEY=your_tavily_key
OPENAI_API_KEY=your_openai_key

Ensure you have the simple_qa_test_set.csv file containing the SimpleQA benchmark in your project directory.

Usage

Evaluate a Single Policy

python eval.py --mode evaluate --policy1 linkup --num-samples 100

Compare Two Policies

python eval.py --mode compare --policy1 linkup --policy2 tavily --num-samples 100

Analyze Existing Results

python eval.py --analyze results/linkup_results.jsonl

Supported QA Policies

The evaluator currently supports three QA policies:

Linkup API (deep search mode)
Linkup API (standard search mode)
Tavily API (advanced search mode)

Output and Metrics

The evaluator generates comprehensive results in the results directory, including detailed per-question analysis and summary metrics. Key performance indicators include accuracy on attempted questions, F-score, average latency, and attempt rates.

Results are automatically saved as:

Detailed results: {policy}_results.jsonl
Summary metrics: {policy}_summary.json
Progress tracking: {policy}_checkpoint.json

Reliability Features

The evaluator includes robust error handling with automatic retries for failed requests, 5-minute timeouts per question, and checkpoint saving to resume interrupted evaluations.

Benchmark Results

Here's a comparison of different QA policies on the SimpleQA benchmark:

Policy	F-Score	Accuracy	Attempt Rate	Avg Latency (s)
Linkup (Deep)	0.910	8.23	99.5%	8.23
Linkup (Standard)	0.850	0.837	96.8%	4.10
Tavily	0.728	0.726	99.6%	2.92

As shown in the results:

Linkup Deep Search achieves the highest f-score and accuracy, with excellent attempt rates
Linkup Standard offers a good balance of performance and speed
Tavily provides the fastest responses while maintaining high attempt rates

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
sampler		sampler
.gitignore		.gitignore
README.md		README.md
eval.py		eval.py
grader.py		grader.py
simple_qa_test_set.csv		simple_qa_test_set.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SimpleQA Evaluator

Features

Setup

Usage

Evaluate a Single Policy

Compare Two Policies

Analyze Existing Results

Supported QA Policies

Output and Metrics

Reliability Features

Benchmark Results

About

Uh oh!

Releases

Packages

Contributors 2

Languages

LinkupPlatform/eval-simpleQA

Folders and files

Latest commit

History

Repository files navigation

SimpleQA Evaluator

Features

Setup

Usage

Evaluate a Single Policy

Compare Two Policies

Analyze Existing Results

Supported QA Policies

Output and Metrics

Reliability Features

Benchmark Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages