Skip to content

vivo/StructureMatters

Repository files navigation

Structured Attention Matters to Multimodal LLMs in Document Understanding

1vivo Mobile Communication Co., Ltd 2The University of Queensland,
3University of California, Merced

Indicates Corresponding Author

🔥 Update

  • [2025-06-18]: 🚀 Codes released.

🎯 Overview

teaser
  • We investigate the significance of structured input in document understanding and propose a simple yet effective method to preserve textual structure. By reformatting the input, we enhance the document comprehension capabilities of multimodal large language models (MLLMs). Furthermore, through attention analysis, we explore the underlying reasons for the importance of structured input.

  • The key contributions of this work are:

    1. Efficient Structure-Preserving Method: We introduce a structure-preserving approach that leverages LaTeX formatting to provide structured input for MLLMs.

    2. Attention Analysis: We demonstrate that structured input leads to structured attention patterns, thereby improving model performance.

    3. Multimodal Structured Input: We show that structured inputs—for both text and images—are essential to achieving structured attention across modalities, ultimately boosting overall performance.

Main_results
  • Extensive and comprehensive experiments demonstrate that our structure-preserving method can significantly enhance document understanding performance by merely changing the input format, and subsequent attention analysis showcases the importance of structured input.

🕹️ Usage

Environment Setup

conda env create -n structureM python=3.12
source activate structureM
cd structure-matters
bash install.sh

Data Preparation

  • Create a data directory:
mkdir data
cd data
  • Download the dataset from huggingface link and place it in the data directory. You can use symbol link or make a copy

  • Return to the project root:

cd ../
  • Extract the data using:
python scripts/extract.py --config-name <dataset>  # (choose from mmlb / ldu / ptab / feta)

The extracted texts and images will be saved in ./tmp/.

Note: For all experiments: <dataset> should choose from (mmlb / ldu / ptab / feta), <run-name> can be any string to uniquely identify this run (required).

Run the following command to generate answers with different input

For MMLongBench and LongDocUrl, which have ground truth retrieval results, use the following command to run different experiments.

For all experiments: <dataset> should choose from mmlb/ldu, <run-name> can be any string to uniquely identify this run (required).

  • Use image as input: Modify the input_type parameter in config/base.yaml to set different input formats.

Choose from: structured-input / image / image-text

python scripts/predict.py --config-name <dataset> run-name=<run-name>  

Attention Analysis for Single and Multiple Samples

python scripts/attention_analysis.py --config-name <dataset> run-name=<run-name>  

Note: This project provides some question samples from MMLongBench for generating heatmaps.
These samples are located in ./results/MMLongBench/images_question_for_heat_map.json.
Before generating the heatmap, you need to obtain the corresponding structured text of each sample and pass it as an input parameter.

🏅 Experiments

  • Comparison of various models with text/structured text on different datasets
Ablation
  • Comparison of various models with structured text on different subsets of MMLongBench
Ablation
  • Please refer to our paper for detailed experimental results.

📑 Citation

If you find our project useful, we hope you can star our repo and cite our paper as follows:

@article{liu2025structured,
  title={Structured Attention Matters to Multimodal LLMs in Document Understanding},
  author={Liu, Chang and Chen, Hongkai and Cai, Yujun and Wu, Hang and Ye, Qingwen and Yang, Ming-Hsuan and Wang, Yiwei},
  journal={Authorea Preprints},
  year={2025},
  publisher={Authorea}
}

📝 Related Projects

Our repository is based on the following projects, we sincerely thank them for their great efforts and excellent work.

  • MDocAgent: latest multi-agent document understanding framework.

License

This project is licensed under the terms of the Apache License 2.0. You are free to use, modify, and distribute this software under the conditions of the license. See the LICENSE file for details.# StructureMatters

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published