The SynthLink Catalog is a collection of complex, multi-hop questions designed for testing deep search / deep research systems. It is split into categories, each in a separate Markdown file:
- Historical Impact Analysis
- Economic and Industrial Shifts
- Environmental and Ecological Consequences
- Scientific and Technological Evolution
- Policy and Social Movements
- STEM and Future Tech
The SynthLink Catalog evaluates deep search responses using a scoring system that measures answer accuracy, source relevance, reasoning quality, fact-checking, and search efficiency. Each question is scored on five metrics:
- F1 Score: Checks how well the answer matches the expected summary.
- Precision@5 (P@5): Measures relevance of the top 5 retrieved sources.
- Reasoning Quality Score (RQS): Assesses if all reasoning steps are covered.
- Fact-Checking Score (FCS): Ensures answers are verifiable, avoiding false claims.
- Iterative Efficiency (IE): Evaluates how quickly the correct answer is found.
Scores are combined into an aggregate score (0–1) with weights emphasizing accuracy and reasoning. For details, see SynthLink_Scoring_System.md. Run scripts/score_synthlink.py
to compute scores automatically.
The SynthLink Catalog scores deep search responses on five metrics: answer accuracy (F1), source relevance (P@5), reasoning quality (RQS), fact-checking (FCS), and efficiency (IE).
A great score is ~0.85, indicating excellent performance. See SynthLink_Scoring_System.md and SynthLink_Scoring_Methodology.md. Run scripts/score_synthlink.py
to compute scores.