Authors: Tajamul Ashraf*, Amal Saqib*, Hanan Gani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip H.S. Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Salman Khan
* Equally contribution, Correspondence: Tajamul Ashraf, Amal Saqib.
[2025-08-03]: Agent-X won Second Place in the Research Track at the Agentic AI Summit 2025, UC Berkeley.
[2025-06-02]: Agent-X paper published on
[2025-05-29]: Released evaluation & deployment code for Agent-X
[2025-05-22]: Published the Agent-X dataset on Hugging Face
Current tool-use tests for vision-centric LLMs rely on single-turn, synthetic queries and text-only inputs, so they miss the real-world challenge of multi-step, multimodal reasoning. Agent-X closes this gap with 828 authentic tasks spanning images, videos, and mixed-modal instructions across six domains—from web browsing to autonomous driving. Each task demands explicit, step-by-step decisions and judicious tool use, and our evaluation scores every reasoning step as well as the overall chain. Even top models (GPT, Gemini, Qwen) solve fewer than half of these tasks, exposing major bottlenecks and pointing the way for future research.
Agent-X is a benchmark for assessing deep-reasoning and tool-use skills of vision-centric LLM agents in real-world settings. It highlights three key aspects:
- Authentic multi-step tasks. The benchmark offers 828 human-authored tasks with implicit tool use and sequential planning requirements, spanning six domains such as web browsing, surveillance, autonomous driving, sports, and math reasoning.
- Real deployed tools. Agent-X supplies an evaluation platform stocked with perception, web, manipulation, math, and data-processing tools, compelling agents to choose and apply the right tool at each reasoning step.
- Diverse multimodal contexts. Every task is paired with real images, multi-image comparisons, or video clips—plus textual instructions, closely mirroring the visual complexity of real-world scenarios.
The comparison of Agent-Xqueries with AI-generated queries is shown in the table below. The steps and tool types for queries in ToolBench and m&m's are explicitly stated, as marked in red and blue. The queries in APIBench are simple, only containing one step. Our GTA's queries are both step-implicit and tool-implicit.
Overview of the Agent-X benchmark. Key data statistics, overall frequency of the tool usage, number of steps, and distribution of tasks across the six vision-centric environments.
We design the Agent-X benchmark using a semi-automated pipeline that ensures each task is solvable with a defined tool subset and requires deep reasoning over realistic, multimodal scenarios. The pipeline begins with an LMM (Large Multimodal Model) generating candidate queries based on visual input and an available toolset. These queries are then refined by human annotators for clarity and realism. Next, the refined queries are passed back to the LMM to produce step-by-step reasoning traces, including tool calls, intermediate outputs, and final answers. Each trace is manually reviewed for logical consistency and correctness.
We evaluate models on the Agent-X benchmark across three distinct modes:
-
Step-by-Step: Assesses the agent’s ability to execute individual reasoning steps, focusing on how well it follows structured tool-use sequences grounded in visual inputs.
-
Deep Reasoning: Evaluates the coherence and logical consistency of the full reasoning trace. This mode emphasizes the agent’s capacity to integrate visual and textual context to produce semantically meaningful and factually accurate explanations.
-
Outcome: Measures the agent’s overall task-solving performance by verifying the correctness of the final answer and appropriate tool usage.
We report results using GPT-4 and Qwen-15B as evaluation judges. For each metric, the best-performing value is shown in bold and underlined, while the second-best is italicized.
Model | Grounds | Toolp | Toolacc | Factacc | Contexts | Factp | Semacc | Goalacc | Goalacc* | Toolaccs |
---|---|---|---|---|---|---|---|---|---|---|
Open-source | ||||||||||
Phi-4-VL-Instruct | 0.13 | 0.21 | 0.24 | 0.61 | 0.19 | 0.47 | 0.40 | 0.11 | 0.26 | 0.42 |
InternVL-2.5-8B | 0.45 | 0.31 | 0.47 | 0.68 | 0.47 | 0.52 | 0.60 | 0.28 | 0.55 | 0.58 |
Gemma-3-4B | 0.26 | 0.30 | 0.78 | 0.61 | 0.54 | 0.38 | 0.54 | 0.27 | 0.67 | 0.60 |
InternVL-3-8B | 0.46 | 0.34 | 0.54 | 0.68 | 0.45 | 0.70 | 0.40 | 0.20 | 0.59 | 0.62 |
VideoLLaMA-3-7B | 0.45 | 0.28 | 0.46 | 0.65 | 0.46 | 0.62 | 0.54 | 0.28 | 0.54 | 0.54 |
Qwen-2.5-VL-7B | 0.54 | 0.43 | 0.63 | 0.75 | 0.57 | 0.56 | 0.67 | 0.36 | 0.65 | 0.67 |
Pixtral-12B | 0.12 | 0.20 | 0.63 | 0.45 | 0.19 | 0.26 | 0.34 | 0.07 | 0.55 | 0.54 |
LLaMA-3.2-11B-Vision | 0.03 | 0.15 | 0.14 | 0.70 | 0.08 | 0.70 | 0.24 | 0.07 | 0.26 | 0.42 |
Kimi-VL-A3B-Thinking | 0.26 | 0.19 | 0.5 | 0.62 | 0.42 | 0.52 | 0.65 | 0.29 | 0.29 | 0.48 |
mPLUG-Owl3-7B-240728 | 0.10 | 0.14 | 0.30 | 0.49 | 0.25 | 0.32 | 0.37 | 0.11 | 0.26 | 0.50 |
Closed-source | ||||||||||
Gemini-1.5-Pro | 0.43 | 0.23 | 0.84 | 0.62 | 0.45 | 0.53 | 0.62 | 0.04 | 0.56 | 0.48 |
Gemini-2.5-Pro | 0.40 | 0.36 | 0.81 | 0.72 | 0.48 | 0.64 | 0.73 | 0.40 | 0.56 | 0.62 |
GPT-4o | 0.60 | 0.47 | 0.72 | 0.81 | 0.57 | 0.79 | 0.59 | 0.37 | 0.70 | 0.68 |
OpenAI o4-mini | 0.42 | 0.32 | 0.89 | 0.71 | 0.51 | 0.60 | 0.80 | 0.45 | 0.67 | 0.63 |
Model | Grounds | Toolp | Toolacc | Factacc | Contexts | Factp | Semacc | Goalacc | Goalacc* | Toolaccs |
---|---|---|---|---|---|---|---|---|---|---|
Open-source | ||||||||||
Phi-4-VL-Instruct | 0.27 | 0.11 | 0.32 | 0.54 | 0.39 | 0.59 | 0.46 | 0.16 | 0.35 | 0.39 |
InternVL2.5-8B | 0.38 | 0.16 | 0.49 | 0.63 | 0.51 | 0.61 | 0.55 | 0.29 | 0.53 | 0.53 |
Gemma-3-4B | 0.50 | 0.24 | 0.67 | 0.74 | 0.66 | 0.59 | 0.74 | 0.30 | 0.68 | 0.68 |
InternVL3-8B | 0.41 | 0.16 | 0.51 | 0.71 | 0.61 | 0.60 | 0.69 | 0.23 | 0.51 | 0.62 |
VideoLLaMA3-7B | 0.39 | 0.15 | 0.40 | 0.68 | 0.56 | 0.60 | 0.68 | 0.27 | 0.53 | 0.56 |
Qwen2.5-VL-7B | 0.51 | 0.27 | 0.63 | 0.77 | 0.66 | 0.64 | 0.77 | 0.37 | 0.62 | 0.67 |
Pixtral-12B | 0.30 | 0.17 | 0.68 | 0.59 | 0.50 | 0.42 | 0.58 | 0.10 | 0.68 | 0.58 |
LLaMA-3.2-11B-Vision | 0.16 | 0.06 | 0.12 | 0.49 | 0.17 | 0.74 | 0.20 | 0.10 | 0.11 | 0.15 |
Kimi-VL-A3B-Thinking | 0.47 | 0.20 | 0.59 | 0.79 | 0.64 | 0.68 | 0.74 | 0.35 | 0.60 | 0.62 |
mPLUG-Owl3-7B-240728 | 0.30 | 0.11 | 0.31 | 0.59 | 0.48 | 0.48 | 0.56 | 0.16 | 0.45 | 0.48 |
Closed-source | ||||||||||
Gemini-1.5-Pro | 0.57 | 0.36 | 0.80 | 0.82 | 0.73 | 0.76 | 0.63 | 0.05 | 0.77 | 0.71 |
Gemini-2.5-Pro | 0.63 | 0.40 | 0.84 | 0.86 | 0.76 | 0.80 | 0.83 | 0.50 | 0.74 | 0.72 |
GPT-4o | 0.46 | 0.27 | 0.63 | 0.72 | 0.59 | 0.75 | 0.69 | 0.44 | 0.48 | 0.56 |
OpenAI-o4-mini | 0.63 | 0.35 | 0.86 | 0.89 | 0.78 | 0.79 | 0.88 | 0.53 | 0.64 | 0.69 |
See generation/README.md
for details on:
- Frame extraction from video clips
- Query generation using GPT-4o
- Step-by-step reasoning trace generation
📁 Path:
generation/README.md
See analysis/README.md
for:
- Error analysis notebook
- Model comparison plots
- Tool usage breakdown and visualizations
📁 Path:
analysis/README.md
See eval/
for:
- Scripted evaluation of model inference results
- Accuracy metrics, binary matching scores, and goal success analysis
- Useful for benchmarking your model outputs against Agent-X GT
📁 Path:
eval/
If you use Agent-Xin your research, please cite the following paper:
@misc{ashraf2025agentxevaluatingdeepmultimodal,
title={Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks},
author={Tajamul Ashraf and Amal Saqib and Hanan Ghani and Muhra AlMahri and Yuhao Li and Noor Ahsan and Umair Nawaz and Jean Lahoud and Hisham Cholakkal and Mubarak Shah and Philip Torr and Fahad Shahbaz Khan and Rao Muhammad Anwer and Salman Khan},
year={2025},
eprint={2505.24876},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.24876},
}