Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Authors: Tajamul Ashraf, Amal Saqib, Hanan Gani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip H.S. Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Salman Khan

* Equally contribution, Correspondence: Tajamul Ashraf, Amal Saqib.

Updates

[2025-08-03]: Agent-X won Second Place in the Research Track at the Agentic AI Summit 2025, UC Berkeley.

[2025-06-02]: Agent-X paper published on

[2025-05-29]: Released evaluation & deployment code for Agent-X

[2025-05-22]: Published the Agent-X dataset on Hugging Face

Introduction

Current tool-use tests for vision-centric LLMs rely on single-turn, synthetic queries and text-only inputs, so they miss the real-world challenge of multi-step, multimodal reasoning. Agent-X closes this gap with 828 authentic tasks spanning images, videos, and mixed-modal instructions across six domains—from web browsing to autonomous driving. Each task demands explicit, step-by-step decisions and judicious tool use, and our evaluation scores every reasoning step as well as the overall chain. Even top models (GPT, Gemini, Qwen) solve fewer than half of these tasks, exposing major bottlenecks and pointing the way for future research.

What is Agent-X?

Agent-X is a benchmark for assessing deep-reasoning and tool-use skills of vision-centric LLM agents in real-world settings. It highlights three key aspects:

Authentic multi-step tasks. The benchmark offers 828 human-authored tasks with implicit tool use and sequential planning requirements, spanning six domains such as web browsing, surveillance, autonomous driving, sports, and math reasoning.
Real deployed tools. Agent-X supplies an evaluation platform stocked with perception, web, manipulation, math, and data-processing tools, compelling agents to choose and apply the right tool at each reasoning step.
Diverse multimodal contexts. Every task is paired with real images, multi-image comparisons, or video clips—plus textual instructions, closely mirroring the visual complexity of real-world scenarios.

The comparison of Agent-Xqueries with AI-generated queries is shown in the table below. The steps and tool types for queries in ToolBench and m&m's are explicitly stated, as marked in red and blue. The queries in APIBench are simple, only containing one step. Our GTA's queries are both step-implicit and tool-implicit.

📚 Dataset Statistics

Overview of the Agent-X benchmark. Key data statistics, overall frequency of the tool usage, number of steps, and distribution of tasks across the six vision-centric environments.

Our Pipeline

We design the Agent-X benchmark using a semi-automated pipeline that ensures each task is solvable with a defined tool subset and requires deep reasoning over realistic, multimodal scenarios. The pipeline begins with an LMM (Large Multimodal Model) generating candidate queries based on visual input and an available toolset. These queries are then refined by human annotators for clarity and realism. Next, the refined queries are passed back to the LMM to produce step-by-step reasoning traces, including tool calls, intermediate outputs, and final answers. Each trace is manually reviewed for logical consistency and correctness.

🏆 Leaderboard, May 2025

Evaluation Protocol

We evaluate models on the Agent-X benchmark across three distinct modes:

Step-by-Step: Assesses the agent’s ability to execute individual reasoning steps, focusing on how well it follows structured tool-use sequences grounded in visual inputs.
Deep Reasoning: Evaluates the coherence and logical consistency of the full reasoning trace. This mode emphasizes the agent’s capacity to integrate visual and textual context to produce semantically meaningful and factually accurate explanations.
Outcome: Measures the agent’s overall task-solving performance by verifying the correctness of the final answer and appropriate tool usage.

We report results using GPT-4 and Qwen-15B as evaluation judges. For each metric, the best-performing value is shown in bold and underlined, while the second-best is italicized.

With GPT-4o as a judge

Model	Ground_s	Tool_p	Tool_acc	Fact_acc	Context_s	Fact_p	Sem_acc	Goal_acc	Goal_acc^*	Tool_acc^s
Open-source
Phi-4-VL-Instruct	0.13	0.21	0.24	0.61	0.19	0.47	0.40	0.11	0.26	0.42
InternVL-2.5-8B	0.45	0.31	0.47	0.68	0.47	0.52	0.60	0.28	0.55	0.58
Gemma-3-4B	0.26	0.30	0.78	0.61	0.54	0.38	0.54	0.27	0.67	0.60
InternVL-3-8B	0.46	0.34	0.54	0.68	0.45	0.70	0.40	0.20	0.59	0.62
VideoLLaMA-3-7B	0.45	0.28	0.46	0.65	0.46	0.62	0.54	0.28	0.54	0.54
Qwen-2.5-VL-7B	0.54	0.43	0.63	0.75	0.57	0.56	0.67	0.36	0.65	0.67
Pixtral-12B	0.12	0.20	0.63	0.45	0.19	0.26	0.34	0.07	0.55	0.54
LLaMA-3.2-11B-Vision	0.03	0.15	0.14	0.70	0.08	0.70	0.24	0.07	0.26	0.42
Kimi-VL-A3B-Thinking	0.26	0.19	0.5	0.62	0.42	0.52	0.65	0.29	0.29	0.48
mPLUG-Owl3-7B-240728	0.10	0.14	0.30	0.49	0.25	0.32	0.37	0.11	0.26	0.50
Closed-source
Gemini-1.5-Pro	0.43	0.23	0.84	0.62	0.45	0.53	0.62	0.04	0.56	0.48
Gemini-2.5-Pro	0.40	0.36	0.81	0.72	0.48	0.64	0.73	0.40	0.56	0.62
GPT-4o	0.60	0.47	0.72	0.81	0.57	0.79	0.59	0.37	0.70	0.68
OpenAI o4-mini	0.42	0.32	0.89	0.71	0.51	0.60	0.80	0.45	0.67	0.63

With Qwen-15B as a judge

Model	Ground_s	Tool_p	Tool_acc	Fact_acc	Context_s	Fact_p	Sem_acc	Goal_acc	Goal_acc^*	Tool_acc^s
Open-source
Phi-4-VL-Instruct	0.27	0.11	0.32	0.54	0.39	0.59	0.46	0.16	0.35	0.39
InternVL2.5-8B	0.38	0.16	0.49	0.63	0.51	0.61	0.55	0.29	0.53	0.53
Gemma-3-4B	0.50	0.24	0.67	0.74	0.66	0.59	0.74	0.30	0.68	0.68
InternVL3-8B	0.41	0.16	0.51	0.71	0.61	0.60	0.69	0.23	0.51	0.62
VideoLLaMA3-7B	0.39	0.15	0.40	0.68	0.56	0.60	0.68	0.27	0.53	0.56
Qwen2.5-VL-7B	0.51	0.27	0.63	0.77	0.66	0.64	0.77	0.37	0.62	0.67
Pixtral-12B	0.30	0.17	0.68	0.59	0.50	0.42	0.58	0.10	0.68	0.58
LLaMA-3.2-11B-Vision	0.16	0.06	0.12	0.49	0.17	0.74	0.20	0.10	0.11	0.15
Kimi-VL-A3B-Thinking	0.47	0.20	0.59	0.79	0.64	0.68	0.74	0.35	0.60	0.62
mPLUG-Owl3-7B-240728	0.30	0.11	0.31	0.59	0.48	0.48	0.56	0.16	0.45	0.48
Closed-source
Gemini-1.5-Pro	0.57	0.36	0.80	0.82	0.73	0.76	0.63	0.05	0.77	0.71
Gemini-2.5-Pro	0.63	0.40	0.84	0.86	0.76	0.80	0.83	0.50	0.74	0.72
GPT-4o	0.46	0.27	0.63	0.72	0.59	0.75	0.69	0.44	0.48	0.56
OpenAI-o4-mini	0.63	0.35	0.86	0.89	0.78	0.79	0.88	0.53	0.64	0.69

📂 Submodules

Generation Pipeline

See generation/README.md for details on:

Frame extraction from video clips
Query generation using GPT-4o
Step-by-step reasoning trace generation

📁 Path: generation/README.md

Analysis & Evaluation

See analysis/README.md for:

Error analysis notebook
Model comparison plots
Tool usage breakdown and visualizations

📁 Path: analysis/README.md

Evaluation Scripts

See eval/ for:

Scripted evaluation of model inference results
Accuracy metrics, binary matching scores, and goal success analysis
Useful for benchmarking your model outputs against Agent-X GT

📁 Path: eval/

📝 Citation

If you use Agent-Xin your research, please cite the following paper:

@misc{ashraf2025agentxevaluatingdeepmultimodal,
      title={Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks}, 
      author={Tajamul Ashraf and Amal Saqib and Hanan Ghani and Muhra AlMahri and Yuhao Li and Noor Ahsan and Umair Nawaz and Jean Lahoud and Hisham Cholakkal and Mubarak Shah and Philip Torr and Fahad Shahbaz Khan and Rao Muhammad Anwer and Salman Khan},
      year={2025},
      eprint={2505.24876},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.24876}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
.github/workflows		.github/workflows
analysis		analysis
data_curation		data_curation
evaluation		evaluation
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Authors: Tajamul Ashraf, Amal Saqib, Hanan Gani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip H.S. Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Salman Khan

Updates

Introduction

What is Agent-X?

📚 Dataset Statistics

Our Pipeline

🏆 Leaderboard, May 2025

Evaluation Protocol

With GPT-4o as a judge

With Qwen-15B as a judge

📂 Submodules

Generation Pipeline

Analysis & Evaluation

Evaluation Scripts

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

mbzuai-oryx/Agent-X

Folders and files

Latest commit

History

Repository files navigation

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Authors: Tajamul Ashraf*, Amal Saqib*, Hanan Gani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip H.S. Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Salman Khan

Updates

Introduction

What is Agent-X?

📚 Dataset Statistics

Our Pipeline

🏆 Leaderboard, May 2025

Evaluation Protocol

With GPT-4o as a judge

With Qwen-15B as a judge

📂 Submodules

Generation Pipeline

Analysis & Evaluation

Evaluation Scripts

📝 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Authors: Tajamul Ashraf, Amal Saqib, Hanan Gani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip H.S. Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Salman Khan

Packages