DeepDive presents an automated approach for training deep search agents that can navigate complex, multi-step information-seeking tasks. Our method combines automated data synthesis from knowledge graphs with end-to-end multi-turn reinforcement learning to create agents capable of sophisticated long-horizon reasoning and web browsing.
- Automated Deep Search Data Synthesis: Generate challenging QA pairs from knowledge graphs through controlled random walks
- Multi-Turn RL Training for Browsing: End-to-end reinforcement learning for deep search capabilities
- Test-Time Scaling: Supports scaling via tool calls and parallel sampling
We propose an automated method to synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. The process involves three key steps:
Knowledge Graph Random Walks: Starting from an initial node
Entity Obfuscation: We combine each node
An LLM then obfuscates information along the entire path, generalizing specific details and creating "blurry entities" that require deep search to resolve.
Difficulty Filtering: We use a frontier model (GPT-4o) with basic search to attempt each question four times. Only questions that the frontier model fails in all attempts are retained, ensuring high difficulty.
We apply end-to-end multi-turn RL to enhance the agent's long-horizon reasoning and browsing capabilities. The training process follows an iterative cycle where at step
Multi-Turn GRPO Training: We employ Group Relative Policy Optimization with normalized advantages:
Strict Binary Rewards: A trajectory receives reward +1 if and only if both format correctness and answer accuracy are satisfied:
This strict reward mechanism ensures high-quality trajectories and prevents reward hacking.
Model | Parameters | HuggingFace Hub | Performance (BrowseComp) |
---|---|---|---|
DeepDive-9B | 9B | coming soon | 6.3% |
DeepDive-32B | 32B | coming soon | 14.8% |
Synthetic Dataset Construction
Our automated data synthesis pipeline creates challenging QA pairs through knowledge graph random walks, entity obfuscation, and difficulty filtering. The process uses multi-hop paths (k=5-9) through KILT and AMiner knowledge graphs.
Component | Size | Explanation |
---|---|---|
Total Dataset | 3,250 | All QA pairs in the training corpus |
SFT Portion | 1,016 | Subset of the data used for Supervised Fine-Tuning (SFT) |
↳ SFT Trajectories | 858 | Search Traces from the SFT QA pairs via reject sampling |
RL Portion | 2,234 | Subset of the data used for Reinforcement Learning (RL) |
Data Example
Main Results
We evaluate DeepDive on four challenging deep search benchmarks: BrowseComp, BrowseComp-ZH, Xbench-DeepSearch, and SEAL-0. The results demonstrate consistent improvements over existing open-source models.
Model | Reason | Browse | BrowseComp | BrowseComp-ZH | Xbench-DeepSearch | SEAL-0 |
---|---|---|---|---|---|---|
Proprietary Models | ||||||
GPT-4o | ✗ | ✗ | 0.9* | 11.1 | 18.0* | 0.9 |
GPT-4o† | ✗ | ✓ | 1.9* | 12.8 | 30.0 | 9.1 |
Claude-3.7-Sonnet | ✗ | ✗ | 2.3 | 11.8 | 12.0 | 2.7 |
Claude-3.7-Sonnet† | ✗ | ✓ | 4.5 | 14.2 | 29.0 | 14.4 |
o1 | ✓ | ✗ | 9.9* | 29.1* | 38.0 | 11.7 |
o4-mini | ✓ | ✗ | 6.1* | 15.2* | 22.3* | 2.7 |
Claude-4-Sonnet-Thinking | ✓ | ✓ | 2.6 | 21.5 | 27.0 | 9.0 |
Claude-4-Sonnet-Thinking† | ✓ | ✗ | 14.7 | 30.8 | 53.0 | 37.8 |
Grok-DeepResearch | ✓ | ✓ | - | 12.9* | 50+ | - |
Doubao-DeepThink | ✓ | ✓ | - | 26.0* | 50+ | - |
DeepResearch | ✓ | ✓ | 51.5* | 42.9* | - | - |
Open-Source Models | ||||||
GLM-Z1-9B-0414 | ✗ | ✗ | 0.6 | 2.4 | 8.0 | 7.2 |
GLM-Z1-9B-0414† | ✗ | ✓ | 0.6 | 1.7 | 3.0 | 2.7 |
Qwen2.5-32B-Instruct | ✗ | ✗ | 0.6 | 9.3 | 8.7* | 2.7 |
Qwen2.5-32B-Instruct† | ✗ | ✓ | 1.5 | 1.7 | 12.0 | 0.9 |
Qwen3-235B-A22B-Instruct-2507 | ✗ | ✗ | 0.9 | 17.6 | 17.0 | 6.3 |
Qwen3-235B-A22B-Instruct-2507† | ✗ | ✓ | 0.9 | 14.9 | 26.0 | 9.1 |
Qwen3-235B-A22B-Thinking-2507 | ✗ | ✗ | 3.1 | 20.1 | 22.0 | 9.0 |
Qwen3-235B-A22B-Thinking-2507† | ✗ | ✓ | 4.6 | 22.5 | 37.0 | 13.5 |
QwQ-32B | ✓ | ✗ | 1.7 | 13.5 | 10.7* | 5.4 |
QwQ-32B† | ✓ | ✓ | 1.3 | 14.5 | 27.0 | 4.5 |
DeepSeek-V3-0324 | ✗ | ✗ | 1.5 | 24.6 | 36.0 | 6.3 |
DeepSeek-R1 | ✓ | ✗ | 2.0 | 23.2 | 32.7* | 5.4 |
DeepSeek-R1-0528 | ✓ | ✗ | 3.2 | 28.7 | 37.0 | 5.4 |
GLM-4.5-Air | ✓ | ✓ | 21.3 | 36.3 | 65.0 | 30.6 |
GLM-4.5 | ✓ | ✓ | 26.4 | 37.5 | 68.0 | 36.0 |
Web Agents | ||||||
Search-o1-32B | ✓ | ✓ | 2.8* | 17.9* | 25.0* | - |
WebThinker-32B | ✓ | ✓ | 2.8* | 7.3* | 24.0* | - |
WebDancer-32B | ✓ | ✓ | 3.8* | 18.0* | 39.0* | - |
WebSailor-7B | ✓ | ✓ | 6.7* | 14.2* | 34.3* | - |
WebSailor-32B | ✓ | ✓ | 10.5* | 25.5* | 53.3* | - |
DeepDive (Ours) | ||||||
DeepDive-9B (sft-only) | ✓ | ✓ | 5.6 | 15.7 | 35.0 | 15.3 |
DeepDive-9B | ✓ | ✓ | 6.3 | 15.1 | 38.0 | 12.2 |
DeepDive-32B (sft-only) | ✓ | ✓ | 9.5 | 23.0 | 48.5 | 23.9 |
DeepDive-32B | ✓ | ✓ | 14.8 | 25.6 | 50.5 | 29.3 |
Note
*
represents reported performance from existing studies. †
represents equipping browsing via function call.
Generalization on Simple Search Tasks
We evaluate DeepDive not only on challenging search tasks (e.g., BrowseComp, BrowseComp-ZH) but also on simpler benchmarks such as HotpotQA, Frames, and WebWalker. DeepDive-32B (SFT and RL) consistently outperforms strong baselines, achieving >60 points on WebWalker, surpassing WebShaper-72B (52.2). These results demonstrate DeepDive’s strong and generalizable search capabilities.
DeepDive demonstrates remarkable test-time scaling capabilities through two mechanisms:
Tool Call Scaling: Allowing DeepDive more tool calls during inference leads to higher accuracy on complex, multi-step tasks. As shown in the BrowseComp and BrowseComp-ZH benchmark results:
- When the maximum number of tool calls is increased, accuracy rises steadily. On BrowseComp, performance improves from 8% accuracy at 8 tool calls to 15% at 128 tool calls, showing that the model benefits from additional search and reasoning opportunities.
- DeepDive-32B consistently outperforms its SFT-only variant, especially when the allowed tool calls exceed 32. This indicates that the reinforcement learning stage better equips the model to utilize long tool call horizons.
Parallel Sampling: Beyond a larger tool call budget, DeepDive leverages parallel sampling to further boost performance. For each question, DeepDive generates 8 independent reasoning trajectories in parallel.
- Three answer selection strategies are considered: single-shot inference, majority voting among samples, and choosing the answer that required the fewest tool calls before submission.
- Empirical analysis reveals a clear trend: answers submitted earlier with fewer tool calls are usually more accurate. In practice, selecting the answer with minimal tool calls among 8 samples raises accuracy from 12.0% (single-shot) to 24.8%, more than doubling performance. Majority voting also helps (18.8% accuracy), but is outperformed by minimal tool-call selection.
To further boost performance on deep search tasks, we create a semi-automated framework for generating i.i.d. QA pairs.
Training with i.i.d. data brings larger improvements. The 32B-RL model reaches 22.2% accuracy on BrowseComp, up from 14.8%, and also performs better on Chinese benchmarks.
Model | Data | BrowseComp | BrowseComp-ZH | Xbench-DeepSearch | SEAL-0 |
---|---|---|---|---|---|
32B (sft-only) | KG | 9.5 | 23.0 | 48.5 | 23.9 |
32B | KG | 14.8 | 25.6 | 50.5 | 29.3 |
32B (sft-only) | i.i.d. | 11.4 | 26.6 | 47.5 | 22.5 |
32B | i.i.d. | 22.2 | 33.9 | 56.0 | 23.0 |
Note
Given the contamination analysis, both the KG data and i.i.d. data are adopted by the open GLM-4.5 models, which show strong performance on BrowseComp.
- Built on top of GLM-4 and QwQ base models
- Uses Slime framework for RL training
- Powered by Serper and Jina APIs for web access
If you find DeepDive useful for your research, please cite our paper:
@misc{lu2025deepdiveadvancingdeepsearch,
title={DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL},
author={Rui Lu and Zhenyu Hou and Zihan Wang and Hanchen Zhang and Xiao Liu and Yujiang Li and Shi Feng and Jie Tang and Yuxiao Dong},
year={2025},
eprint={2509.10446},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.10446},
}