WebThinker

This repository is a community version of WebThinker. It aims to learn the popular LangGraph framework and explore how to convert the completion mode agent into chat mode agent.

Notice: This is not a perfect implementation of WebThinker, thus, the experimental results are just for reference.

How to use

Setup Environment

conda create -n webthinker python=3.13

Install Requirements

pip install -r requirements.txt

Prepare Data

python -m src.webthinker.prepare

Notice: Problem solving datasets in data/encoded are encoded to avoid web crawler. They are decoded during data preparation.

Solve Problem

Run inference and evaluation:

python -m src.webthinker.run --dataset gaia --ids all --llm_eval

Arguments:

--dataset: specify the dataset.
--ids: use "all" to run all samples or specify some IDs such as "1,2,3".
--langsmith: whether to store intermediate steps in detail via LangSmith.
--llm_eval: whether to use llm evaluation.

Only run evaluation:

python -m src.webthinker.evaluate --path /path/to/results.json --llm_eval

Arguments:

--path: specify the result path.
--llm_eval: whether to use llm evaluation.

Generate Reports

python -m src.webthinker.run_report --dataset glaive --ids all

Arguments:

--dataset: specify the dataset.
--ids: use "all" to run all samples or specify some IDs such as "1,2,3".
--langsmith: whether to store intermediate steps in detail via LangSmith.

Difference with official code

This version is based on LangGraph.
Since some LLM does not provide completion mode API, we use the chat mode instead of the completion mode in this version. This will potentially affect the reasoning coherence of the LLMs.
Search query and report drafting tools are implemented in a standard tool calling way.
Some validation check are removed since they have never been triggered.
We do not implement the deep web explorer.
Prompts are slightly modified to adapt the chat mode and tool calling.
Currently, LangGraph store only supports vector indexing, thus, we did not use the official store.
We support both google serper and tavily search (default to use google serper).
We have not implemented the report evaluation.

Preliminary Result

Problem Solving Results

Notice: This is only the results for one test round, thus there may exist randomness.

Overall:

Dataset	F1	Acc	EM	LLM Score
GAIA	46.83	41.75	25.24	40.78

Grouped:

Dataset	Group	F1	Acc	EM	LLM Score
GAIA	1	59.01	43.59	30.77	46.15
GAIA	2	41.35	44.23	25.00	43.59
GAIA	3	30.98	25.00	8.33	8.33

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WebThinker

How to use

Setup Environment

Install Requirements

Prepare Data

Solve Problem

Generate Reports

Difference with official code

Preliminary Result

Problem Solving Results

About

Uh oh!

Releases

Packages

Languages

License

waltbai/WebThinker

Folders and files

Latest commit

History

Repository files navigation

WebThinker

How to use

Setup Environment

Install Requirements

Prepare Data

Solve Problem

Generate Reports

Difference with official code

Preliminary Result

Problem Solving Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages