Skip to content

cxcscmu/deepresearch_benchmarking

Repository files navigation

Benchmarking deepresearch systems

Setup

  • pip install openai
  • Crawl4AI (for eval_citation_async.py)

Create keys.env file with:
OPENAI_API_KEY=...

Origanization:

Main branch contains evaluation scripts.

Branches contain runs for specific DeepResearch frameworks on our queries.

Data

The queries we use to generate deepsearch reports are under:

  • ./queries/researchy_queries_sample_doc_click.jsonl
  • format: {"id": <id>, "query": <text>}

The evaluation scripts read data from:

  • /data/group_data/cx_group/deepsearch_benchmark/reports/
  • This folder contains subfolders for each system, e.g:
    • GPTResearcher: answers generated by GPTResearcher with the original API
    • GPTResearcher_custom: answers generated by GPTResearcher with the original API with our API
    • The subfolders content follow this structure:
      • <id>.a : answer from deepresearch system for query with id <id>
      • <id>.q : query with id <id>

Evaluation:

Retrieval Faithfulness

eval_citation_async.py : computes citation precision as per TREC-RAG

  • python eval_citation_async.py --subfolder [deepsearch_model_to_eval] --open_ai_model [llm_judge]
    • [deepsearch_model_to_eval] must have a folder under /data/group_data/cx_group/deepsearch_benchmark/reports
    • [llm_judge] can be any model that supports OpenAI API (currently using gpt-4.1-mini)
    • writes evaluation_results_citation_gpt-4.1-mini.json to [deepsearch_model_to_eval] folder.
    • WARNING: this one can be slow and cost a lot of $. don't run unless you really need to.
    • TODO: on some machins async calls may hang. Hence, semaphore is set low, and only processes 250 reports at a time.

eval_citation_clueweb_async.py : computes citation precision as per TREC-RAG, for systems that use a ClueWeb API for search

  • python eval_citation_clueweb_async.py --subfolder [deepsearch_model_to_eval] --open_ai_model [llm_judge]
    • [deepsearch_model_to_eval] must have a folder under /data/group_data/cx_group/deepsearch_benchmark/reports
    • [llm_judge] can be any model that supports OpenAI API (currently using gpt-4.1-mini)
    • writes evaluation_results_citation_gpt-4.1-mini.json to [deepsearch_model_to_eval] folder.
    • WARNING: this one can be slow and cost a lot of $. don't run unless you really need to.
    • TODO: on some machins async calls may hang. Hence, semaphore is set low, and only processes 250 reports at a time.

eval_citation_recall_async.py : computes citation recall as per TREC-RAG - same as above.

  • TODO: merge with eval_citation async. This only computes percentage of supported claims, so no need to crawl URL with either clueweb or crawl4ai.

Report Quality

eval_quality_async.py : computes holistic report quality using standard LLM as a judge viewpoints

  • python eval_citation_async.py --subfolder [deepsearch_model_to_eval] --open_ai_model [llm_judge]
    • [deepsearch_model_to_eval] must have a folder under /data/group_data/cx_group/deepsearch_benchmark/reports
    • [llm_judge] can be any model that supports OpenAI API (currently using gpt-4.1-mini)
    • writes evaluation_results_detailed_gpt-4.1-mini.json to [deepsearch_model_to_eval] folder.

Report Relevance

eval_kpr_async.py : computes percentage of ground-truth key points that each report addresses.

  • python eval_kpr_async.py --subfolder [deepsearch_model_to_eval] --open_ai_model [llm_judge]
    • [deepsearch_model_to_eval] must have a folder under /data/group_data/cx_group/deepsearch_benchmark/reports
    • [llm_judge] can be any model that supports OpenAI API (currently using gpt-4.1-mini)
    • writes evaluation_results_kpr_gpt-4.1-mini.json to [deepsearch_model_to_eval] folder.

Key points are already extracted for each query, and can be found under key_point. Since there are multiple ground-truth documents per query, first KPs are extracted from each doc (key_point/key_point_extract.py) and the aggregated (key_point/aggregate.py)

Other / Utils

./systems contains baseline deepresearch systems:

  • perplexity, through sonar deepresearch API
  • openai, through gpt4o-search-preview [no better model is available through API]

./plots contains scripts to generate plots (TODO: cleanup)

./clueweb22 contains the clueweb22 API to fetch documents by docid/url.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •