Structured Attention Matters to Multimodal LLMs in Document Understanding

Chang Liu¹, Hongkai Chen^1†, Yujun Cai², Hang Wu^1,3, Qingwen Ye¹, Ming-Hsuan Yang³, Yiwei Wang ³,

¹vivo Mobile Communication Co., Ltd ²The University of Queensland,
³University of California, Merced
^†Indicates Corresponding Author

🔥 Update

[2025-06-18]: 🚀 Codes released.

🎯 Overview

We investigate the significance of structured input in document understanding and propose a simple yet effective method to preserve textual structure. By reformatting the input, we enhance the document comprehension capabilities of multimodal large language models (MLLMs). Furthermore, through attention analysis, we explore the underlying reasons for the importance of structured input.
The key contributions of this work are:
1. Efficient Structure-Preserving Method: We introduce a structure-preserving approach that leverages LaTeX formatting to provide structured input for MLLMs.
2. Attention Analysis: We demonstrate that structured input leads to structured attention patterns, thereby improving model performance.
3. Multimodal Structured Input: We show that structured inputs—for both text and images—are essential to achieving structured attention across modalities, ultimately boosting overall performance.

Extensive and comprehensive experiments demonstrate that our structure-preserving method can significantly enhance document understanding performance by merely changing the input format, and subsequent attention analysis showcases the importance of structured input.

🕹️ Usage

Environment Setup

conda env create -n structureM python=3.12
source activate structureM
cd structure-matters
bash install.sh

Data Preparation

Create a data directory:

mkdir data
cd data

Download the dataset from huggingface link and place it in the data directory. You can use symbol link or make a copy
Return to the project root:

cd ../

Extract the data using:

python scripts/extract.py --config-name <dataset>  # (choose from mmlb / ldu / ptab / feta)

The extracted texts and images will be saved in ./tmp/.

Note: For all experiments: <dataset> should choose from (mmlb / ldu / ptab / feta), <run-name> can be any string to uniquely identify this run (required).

Run the following command to generate answers with different input

For MMLongBench and LongDocUrl, which have ground truth retrieval results, use the following command to run different experiments.

For all experiments: <dataset> should choose from mmlb/ldu, <run-name> can be any string to uniquely identify this run (required).

Use image as input: Modify the input_type parameter in config/base.yaml to set different input formats.

Choose from: `structured-input` / `image` / `image-text`

python scripts/predict.py --config-name <dataset> run-name=<run-name>

Attention Analysis for Single and Multiple Samples

python scripts/attention_analysis.py --config-name <dataset> run-name=<run-name>

Note: This project provides some question samples from MMLongBench for generating heatmaps.
These samples are located in ./results/MMLongBench/images_question_for_heat_map.json.
Before generating the heatmap, you need to obtain the corresponding structured text of each sample and pass it as an input parameter.

🏅 Experiments

Comparison of various models with text/structured text on different datasets

Comparison of various models with structured text on different subsets of MMLongBench

Please refer to our paper for detailed experimental results.

📑 Citation

If you find our project useful, we hope you can star our repo and cite our paper as follows:

@article{liu2025structured,
  title={Structured Attention Matters to Multimodal LLMs in Document Understanding},
  author={Liu, Chang and Chen, Hongkai and Cai, Yujun and Wu, Hang and Ye, Qingwen and Yang, Ming-Hsuan and Wang, Yiwei},
  journal={Authorea Preprints},
  year={2025},
  publisher={Authorea}
}

📝 Related Projects

Our repository is based on the following projects, we sincerely thank them for their great efforts and excellent work.

MDocAgent: latest multi-agent document understanding framework.

License

This project is licensed under the terms of the Apache License 2.0. You are free to use, modify, and distribute this software under the conditions of the license. See the LICENSE file for details.# StructureMatters

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
ablation		ablation
agents		agents
assets		assets
config		config
models		models
mydatasets		mydatasets
results/MMLongBench		results/MMLongBench
retrieval		retrieval
scripts		scripts
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Structured Attention Matters to Multimodal LLMs in Document Understanding

🔥 Update

🎯 Overview

🕹️ Usage

Environment Setup

Data Preparation

Run the following command to generate answers with different input

Choose from: `structured-input` / `image` / `image-text`

Attention Analysis for Single and Multiple Samples

🏅 Experiments

📑 Citation

📝 Related Projects

License

About

Uh oh!

Releases

Packages

Languages

License

vivo/StructureMatters

Folders and files

Latest commit

History

Repository files navigation

Structured Attention Matters to Multimodal LLMs in Document Understanding

🔥 Update

🎯 Overview

🕹️ Usage

Environment Setup

Data Preparation

Run the following command to generate answers with different input

Choose from: structured-input / image / image-text

Attention Analysis for Single and Multiple Samples

🏅 Experiments

📑 Citation

📝 Related Projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Choose from: `structured-input` / `image` / `image-text`

Packages