GitHub

Benchmarking deepresearch systems

Setup

pip install openai
Crawl4AI (for eval_citation_async.py)

Create keys.env file with:
OPENAI_API_KEY=...

Origanization:

Main branch contains evaluation scripts.

Branches contain runs for specific DeepResearch frameworks on our queries.

Data

The queries we use to generate deepsearch reports are under:

./queries/researchy_queries_sample_doc_click.jsonl
format: {"id": <id>, "query": <text>}

The evaluation scripts read data from:

/data/group_data/cx_group/deepsearch_benchmark/reports/
This folder contains subfolders for each system, e.g:
- GPTResearcher: answers generated by GPTResearcher with the original API
- GPTResearcher_custom: answers generated by GPTResearcher with the original API with our API
- The subfolders content follow this structure:
  - <id>.a : answer from deepresearch system for query with id <id>
  - <id>.q : query with id <id>

Evaluation:

Retrieval Faithfulness

eval_citation_async.py : computes citation precision as per TREC-RAG

python eval_citation_async.py --subfolder [deepsearch_model_to_eval] --open_ai_model [llm_judge]
- [deepsearch_model_to_eval] must have a folder under /data/group_data/cx_group/deepsearch_benchmark/reports
- [llm_judge] can be any model that supports OpenAI API (currently using gpt-4.1-mini)
- writes evaluation_results_citation_gpt-4.1-mini.json to [deepsearch_model_to_eval] folder.
- WARNING: this one can be slow and cost a lot of $. don't run unless you really need to.
- TODO: on some machins async calls may hang. Hence, semaphore is set low, and only processes 250 reports at a time.

eval_citation_clueweb_async.py : computes citation precision as per TREC-RAG, for systems that use a ClueWeb API for search

python eval_citation_clueweb_async.py --subfolder [deepsearch_model_to_eval] --open_ai_model [llm_judge]
- [deepsearch_model_to_eval] must have a folder under /data/group_data/cx_group/deepsearch_benchmark/reports
- [llm_judge] can be any model that supports OpenAI API (currently using gpt-4.1-mini)
- writes evaluation_results_citation_gpt-4.1-mini.json to [deepsearch_model_to_eval] folder.
- WARNING: this one can be slow and cost a lot of $. don't run unless you really need to.
- TODO: on some machins async calls may hang. Hence, semaphore is set low, and only processes 250 reports at a time.

eval_citation_recall_async.py : computes citation recall as per TREC-RAG - same as above.

TODO: merge with eval_citation async. This only computes percentage of supported claims, so no need to crawl URL with either clueweb or crawl4ai.

Report Quality

eval_quality_async.py : computes holistic report quality using standard LLM as a judge viewpoints

python eval_citation_async.py --subfolder [deepsearch_model_to_eval] --open_ai_model [llm_judge]
- [deepsearch_model_to_eval] must have a folder under /data/group_data/cx_group/deepsearch_benchmark/reports
- [llm_judge] can be any model that supports OpenAI API (currently using gpt-4.1-mini)
- writes evaluation_results_detailed_gpt-4.1-mini.json to [deepsearch_model_to_eval] folder.

Report Relevance

eval_kpr_async.py : computes percentage of ground-truth key points that each report addresses.

python eval_kpr_async.py --subfolder [deepsearch_model_to_eval] --open_ai_model [llm_judge]
- [deepsearch_model_to_eval] must have a folder under /data/group_data/cx_group/deepsearch_benchmark/reports
- [llm_judge] can be any model that supports OpenAI API (currently using gpt-4.1-mini)
- writes evaluation_results_kpr_gpt-4.1-mini.json to [deepsearch_model_to_eval] folder.

Key points are already extracted for each query, and can be found under key_point. Since there are multiple ground-truth documents per query, first KPs are extracted from each doc (key_point/key_point_extract.py) and the aggregated (key_point/aggregate.py)

Other / Utils

./systems contains baseline deepresearch systems:

perplexity, through sonar deepresearch API
openai, through gpt4o-search-preview [no better model is available through API]

./plots contains scripts to generate plots (TODO: cleanup)

./clueweb22 contains the clueweb22 API to fetch documents by docid/url.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Benchmarking deepresearch systems

Setup

Origanization:

Data

Evaluation:

Retrieval Faithfulness

Report Quality

Report Relevance

Other / Utils

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
clueweb22		clueweb22
key_point		key_point
plots		plots
queries		queries
results/mini		results/mini
systems		systems
.gitignore		.gitignore
README.md		README.md
eval_citation_async.py		eval_citation_async.py
eval_citation_clueweb_async.py		eval_citation_clueweb_async.py
eval_citation_recall_async.py		eval_citation_recall_async.py
eval_kpr_async.py		eval_kpr_async.py
eval_quality_async.py		eval_quality_async.py
run.sh		run.sh

cxcscmu/deepresearch_benchmarking

Folders and files

Latest commit

History

Repository files navigation

Benchmarking deepresearch systems

Setup

Origanization:

Data

Evaluation:

Retrieval Faithfulness

Report Quality

Report Relevance

Other / Utils

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages