INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

Chenwei Lin¹, Hanjia Lyu², Xian Xu¹, Jiebo Luo²
¹ Fudan University ² University of Rochester

🔥 Accepted for publication in ICCV 2025

Introduction

INS-MMBench is the first comprehensive LVLMs benchmark for the insurance domain, it covers four representative insurance types: auto, property, health, and agricultural insurance and key insurance stages such as risk underwriting, risk monitoring and claim processing. INS-MMBench consists of three layers task:

Fundamental task, which focuses on the understanding of individual insurance-related visual elements;
Meta-task, which involves the compositional understanding of multiple insurance-related visual elements;
Scenario task, which pertains to real-world insurance tasks requiring multi-step reasoning and decision-making.

INS-MMBench includes a total of 12,052 images, 10,372 thoroughly designed questions (including multiple-choice visual questions and free-text visual questions), comprehensively covering 5 scenario tasks, 12 meta-tasks and 22 fundamental tasks.

Evaluation Results Overview

Our experiments consist of two stages. In the first stage, we evaluate meta-tasks and fundamental tasks, conducting evaluation on all selected LVLMs to assess their performance in fundamental multimodal insurance tasks. In the second stage, we select a subset of top-performing models for further scenario task evaluation, where models are evaluated on end-to-end multimodal reasoning tasks to assess their effectiveness in real-world insurance applications. According to the evaluation results, we have following observations:

GPT-4o leads in performance but highlights challenges for LVLMs in insurance tasks: GPT-4o outperforms all other models with a score of 69.70, but most LVLMs do not significantly surpass human performance across various insurance tasks, indicating substantial room for improvement in this domain.
LVLMs show significant variance across different insurance types and meta-tasks: LVLMs perform better in auto and health insurance tasks than in property and agricultural insurance, excelling in visual perception tasks but struggling with knowledge-intensive reasoning tasks, suggesting a progressive adoption strategy focused on their strengths.
LVLMs exhibit significant limitations in complex insurance scenarios: Current LVLMs face challenges in multi-step reasoning, with only a few tasks exceeding 80% accuracy, emphasizing the need for domain-specific datasets and reinforcement learning to enhance reasoning capabilities for insurance applications.
Narrowing gap between open-source and closed-source LVLMs: While closed-source LVLMs still hold an advantage, some open-source models are approaching comparable performance, indicating that training high-performance, domain-specific LVLMs could be a key strategy for insurance applications.

Quick start

Step 1: Installation

To set up the project, run the following commands:

git clone https://github.com/FDU-INS/INS-MMBench.git
cd INS-MMBench
pip install -r requirements.txt

Step 2: Configuration

1. VLM Configuration

All VLMs are configured in vlmeval/config.py.
For some VLMs, you need to configure the code root (e.g., MiniGPT-4, PandaGPT) or the model_weight root (e.g., LLaVA-v1-7B) before conducting the evaluation.
During evaluation, use the model name specified in supported_VLM in vlmeval/config.py to select the VLM.
For models not listed in supported_VLM, you can customize the configuration.

2. Benchmark Download

The datasets can be downloaded through the provided Hugging Face link .
Place the dataset folder in the default path $HOME/LMUData or a custom path.

3. `.env` File Setup

Update the .env file with necessary information such as API keys, base URLs, and other settings required for model integration.

Step 3: Evaluation

Use the run.py script for evaluation.
You can execute it via $VLMEvalKit/run.py or create a soft-link to use the script anywhere.

Arguments

Argument	Type	Description
`--data`	`list[str]`	Specify dataset names supported in VLMEvalKit (defined in `vlmeval/utils/data_util.py`).
`--model`	`list[str]`	Specify VLM names supported in VLMEvalKit (defined in `supported_VLM` in `vlmeval/config.py`).
`--mode`	`str`	Evaluation mode: `"all"` (default) for both inference and evaluation; `"infer"` for inference only.
`--nproc`	`int`	Number of threads for OpenAI API calling (default: 4).
`--verbose`	`bool`	Enable verbose logging.

Example Command

python run.py --data INS-MMBench --model GPT4o --verbose --nproc 4

The final results will be saved in a folder named after the model used for evaluation.

🔁 Reproducibility

Table 3

python reproducibility/table_3.py

Table 4

python reproducibility/table_4.py

Table 5

python reproducibility/table_5.py

Table 6

python reproducibility/table_6.py

Table 7

python reproducibility/table_7.py

🖊️ Citation

If you find our work useful in your project or research, please use the following BibTeX entry to cite our paper. Thanks!

@article{insmmbench,
    title={INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs’ Performance in Insurance}, 
    author={Lin, Chenwei and Lyu, Hanjia and Xu, Xian and Luo, Jiebo},
    journal={arXiv preprint arXiv:2406.09105},
    year={2024}
}

📜 Related Works

Explore our related researches:

Harnessing gpt-4v (ision) for insurance: A preliminary exploration

💐 Acknowledgement

We express our sincere gratitude to the following projects:

VLMEvalKit provides useful out-of-the-box tools and implements many advanced models. Thanks for their selfless dedication.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
assets		assets
example_cases		example_cases
reproducibility		reproducibility
vlmeval		vlmeval
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

Introduction

Evaluation Results Overview

Quick start

Step 1: Installation

Step 2: Configuration

1. VLM Configuration

2. Benchmark Download

3. `.env` File Setup

Step 3: Evaluation

Arguments

Example Command

🔁 Reproducibility

Table 3

Table 4

Table 5

Table 6

Table 7

🖊️ Citation

📜 Related Works

💐 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

FDU-INS/INS-MMBench

Folders and files

Latest commit

History

Repository files navigation

INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

Introduction

Evaluation Results Overview

Quick start

Step 1: Installation

Step 2: Configuration

1. VLM Configuration

2. Benchmark Download

3. .env File Setup

Step 3: Evaluation

Arguments

Example Command

🔁 Reproducibility

Table 3

Table 4

Table 5

Table 6

Table 7

🖊️ Citation

📜 Related Works

💐 Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

3. `.env` File Setup

Packages