pip install openai
- Crawl4AI (for
eval_citation_async.py
)
Create keys.env file with:
OPENAI_API_KEY=...
Main branch contains evaluation scripts.
Branches contain runs for specific DeepResearch frameworks on our queries.
The queries we use to generate deepsearch reports are under:
./queries/researchy_queries_sample_doc_click.jsonl
- format: {"id": <id>, "query": <text>}
The evaluation scripts read data from:
/data/group_data/cx_group/deepsearch_benchmark/reports/
- This folder contains subfolders for each system, e.g:
- GPTResearcher: answers generated by GPTResearcher with the original API
- GPTResearcher_custom: answers generated by GPTResearcher with the original API with our API
- The subfolders content follow this structure:
- <id>.a : answer from deepresearch system for query with id <id>
- <id>.q : query with id <id>
eval_citation_async.py : computes citation precision as per TREC-RAG
python eval_citation_async.py --subfolder [deepsearch_model_to_eval] --open_ai_model [llm_judge]
- [deepsearch_model_to_eval] must have a folder under
/data/group_data/cx_group/deepsearch_benchmark/reports
- [llm_judge] can be any model that supports OpenAI API (currently using gpt-4.1-mini)
- writes
evaluation_results_citation_gpt-4.1-mini.json
to [deepsearch_model_to_eval] folder. - WARNING: this one can be slow and cost a lot of $. don't run unless you really need to.
- TODO: on some machins async calls may hang. Hence, semaphore is set low, and only processes 250 reports at a time.
- [deepsearch_model_to_eval] must have a folder under
eval_citation_clueweb_async.py : computes citation precision as per TREC-RAG, for systems that use a ClueWeb API for search
python eval_citation_clueweb_async.py --subfolder [deepsearch_model_to_eval] --open_ai_model [llm_judge]
- [deepsearch_model_to_eval] must have a folder under
/data/group_data/cx_group/deepsearch_benchmark/reports
- [llm_judge] can be any model that supports OpenAI API (currently using gpt-4.1-mini)
- writes
evaluation_results_citation_gpt-4.1-mini.json
to [deepsearch_model_to_eval] folder. - WARNING: this one can be slow and cost a lot of $. don't run unless you really need to.
- TODO: on some machins async calls may hang. Hence, semaphore is set low, and only processes 250 reports at a time.
- [deepsearch_model_to_eval] must have a folder under
eval_citation_recall_async.py : computes citation recall as per TREC-RAG - same as above.
- TODO: merge with eval_citation async. This only computes percentage of supported claims, so no need to crawl URL with either clueweb or crawl4ai.
eval_quality_async.py : computes holistic report quality using standard LLM as a judge viewpoints
python eval_citation_async.py --subfolder [deepsearch_model_to_eval] --open_ai_model [llm_judge]
- [deepsearch_model_to_eval] must have a folder under
/data/group_data/cx_group/deepsearch_benchmark/reports
- [llm_judge] can be any model that supports OpenAI API (currently using gpt-4.1-mini)
- writes
evaluation_results_detailed_gpt-4.1-mini.json
to [deepsearch_model_to_eval] folder.
- [deepsearch_model_to_eval] must have a folder under
eval_kpr_async.py : computes percentage of ground-truth key points that each report addresses.
python eval_kpr_async.py --subfolder [deepsearch_model_to_eval] --open_ai_model [llm_judge]
- [deepsearch_model_to_eval] must have a folder under
/data/group_data/cx_group/deepsearch_benchmark/reports
- [llm_judge] can be any model that supports OpenAI API (currently using gpt-4.1-mini)
- writes
evaluation_results_kpr_gpt-4.1-mini.json
to [deepsearch_model_to_eval] folder.
- [deepsearch_model_to_eval] must have a folder under
Key points are already extracted for each query, and can be found under key_point
. Since there are multiple ground-truth documents per query, first KPs are extracted from each doc (key_point/key_point_extract.py
) and the aggregated (key_point/aggregate.py
)
./systems contains baseline deepresearch systems:
- perplexity, through sonar deepresearch API
- openai, through gpt4o-search-preview [no better model is available through API]
./plots contains scripts to generate plots (TODO: cleanup)
./clueweb22 contains the clueweb22 API to fetch documents by docid/url.