This repository is a community version of WebThinker. It aims to learn the popular LangGraph framework and explore how to convert the completion mode agent into chat mode agent.
Notice: This is not a perfect implementation of WebThinker, thus, the experimental results are just for reference.
conda create -n webthinker python=3.13
pip install -r requirements.txt
python -m src.webthinker.prepare
Notice: Problem solving datasets in data/encoded
are encoded to avoid web crawler.
They are decoded during data preparation.
Run inference and evaluation:
python -m src.webthinker.run --dataset gaia --ids all --llm_eval
Arguments:
--dataset
: specify the dataset.--ids
: use "all" to run all samples or specify some IDs such as "1,2,3".--langsmith
: whether to store intermediate steps in detail via LangSmith.--llm_eval
: whether to use llm evaluation.
Only run evaluation:
python -m src.webthinker.evaluate --path /path/to/results.json --llm_eval
Arguments:
--path
: specify the result path.--llm_eval
: whether to use llm evaluation.
python -m src.webthinker.run_report --dataset glaive --ids all
Arguments:
--dataset
: specify the dataset.--ids
: use "all" to run all samples or specify some IDs such as "1,2,3".--langsmith
: whether to store intermediate steps in detail via LangSmith.
- This version is based on LangGraph.
- Since some LLM does not provide completion mode API, we use the chat mode instead of the completion mode in this version. This will potentially affect the reasoning coherence of the LLMs.
- Search query and report drafting tools are implemented in a standard tool calling way.
- Some validation check are removed since they have never been triggered.
- We do not implement the deep web explorer.
- Prompts are slightly modified to adapt the chat mode and tool calling.
- Currently, LangGraph store only supports vector indexing, thus, we did not use the official store.
- We support both google serper and tavily search (default to use google serper).
- We have not implemented the report evaluation.
Notice: This is only the results for one test round, thus there may exist randomness.
Overall:
Dataset | F1 | Acc | EM | LLM Score |
---|---|---|---|---|
GAIA | 46.83 | 41.75 | 25.24 | 40.78 |
Grouped:
Dataset | Group | F1 | Acc | EM | LLM Score |
---|---|---|---|---|---|
GAIA | 1 | 59.01 | 43.59 | 30.77 | 46.15 |
GAIA | 2 | 41.35 | 44.23 | 25.00 | 43.59 |
GAIA | 3 | 30.98 | 25.00 | 8.33 | 8.33 |