Skip to content

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Notifications You must be signed in to change notification settings

eric-ai-lab/MLRM-Halu

 
 

Repository files navigation

Project Logo

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Teaser figure

News

  • [May 2025] Paper is now available. 📢

TO DO

  • Release visualization tools and reasoning length control strategies.
  • Release small-scale RH-Bench benchmark.
  • Expand and refine RH-Bench to support more multimodal reasoning model. Coming soon!

❗Abstract

Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, we observe that this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more on language priors. Attention analysis reveals that longer reasoning chains reduce focus on visual inputs, contributing to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model’s perception accuracy changes with reasoning length, enabling evaluation of whether the model preserves visual grounding while reasoning. We also release RH-Bench, a diagnostic benchmark covering diverse multimodal tasks, designed to jointly assess the balance of reasoning ability and hallucination. We find that (i) larger models generally exhibit a better balance between reasoning and perception; and (ii) this balance depends more on the types and domains of the training data than its volume. Our findings highlight the need for evaluation frameworks that account for both reasoning quality and perceptual reliability.

🎯 Visualization

python heatmap.py \
--image_path /data/image.jpg  \
--question "Describe this image in detail."
python layer_analysis.py \
  --model-path "R1-OneVision/" \
  --image-folder "images/" \
  --question-file "question.jsonl" \
  --answers-file "./results.pt" \
  --plot-path "./attention_distribution.png"

Teaser figure

🕹️ Reasoning Length Contorl

Budget Forcing & Test Time Scaling

Refer to budget_forcing.py and Scaling_more.py in the length_control directory.

Latent State Steering

Step 1 Collect responses from multimodal reasoning models on various tasks to extract hidden states of internal attention later.

python generate_response_your_data.py \
  --input "/your/dataset/path/annotation.jsonl" \
  --output "/your/output/path/results.jsonl" \
  --model_id "/your/model/path/ModelDirectory/" \
  --num_samples 100 \
  --device "cuda:2"

Step 2 Extract per-layer directional vectors from the residual inputs of the self-attention mechanism in multimodal reasoning models. It supports two modes: text mode, which processes only the question and thinking tokens, and vision mode, which processes the image along with the question and thinking tokens.

python get_direction.py \
  --model /path/to/your/model/ \
  --json_path /path/to/your/response.jsonl \
  --output_path /path/to/save/steering_direction.pt \
  --mode text

Step 3 Control the reasoning length of the multimodal model and obtain responses under each steering state. The current range is [-0.1,0.1], which can be adjusted according to different datasets and tasks. Note: Extremely large or small parameter values may degrade the model's performance. Support both automated sweeping of parameters via range input and manual specification of individual values.

python steering_mlrm.py \
  --dataset /path/to/dataset.jsonl \
  --output results/output.jsonl \
  --model_id /path/to/model \
  --image_root /path/to/images \
  --direction_path /path/to/direction_vector.pt \
  --direction_weights_range -0.1 0.1 0.01 \ #  (start, end, and step )
  --num_samples 100 \
  --device cuda:0

Teaser figure

🧐 Evaluation

Model

Model Link
R1-Onevision 🤗 R1-Onevision
ThinkLite-VL 🤗 ThinkLite-VL
MM-Eureka-Qwen 🤗 MM-Eureka-Qwen
Vision-R1 🤗 Vision-R1
Ocean-R1 🤗 Ocean-R1
MM-R1 🤗 MM-R1
Curr-ReFT 🤗 MM-R1
LLM- R1 🤗 LLM-R1
Skywork-R1V 🤗 Skywork-R1V

Control the reasoning length using the provided method or other approaches on RH-Bench or other datasets. Evaluate the responses and visualize the RH-AUC scores. Future updates will include more efficient methods for reasoning-hallucination balance evaluation. 🔔

# Reason
python evaluation_rhbench_reason.py \
--input_dir "/data/steering_reason/" \
--output_dir "/data/steering_reason/score" \
--summary_file "/data/steering_reason/evaluation_summary.txt"

# Hallucination
python evaluation_rhbench_perception.py \
--input_dir "/data/steering_hallu/" \
--output_dir "/data/steering_hallu/score" \
--summary_file "/data/steering_hallu/evaluation_summary.txt"

# RH-AUC Score -- Adjust according to your file format or method.
python RH-AUC.py --txt_file_reason '/path/to/your/evaluation_summary_reason.txt' --txt_file_hallu '/path/to/your/evaluation_summary_hallucination.txt'

🙏 Citation

If you find the code is valuable, please use this citation.

 @misc{liu2025thinkingseeingassessingamplified,
      title={More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models}, 
      author={Chengzhi Liu and Zhongxing Xu and Qingyue Wei and Juncheng Wu and James Zou and Xin Eric Wang and Yuyin Zhou and Sheng Liu},
      year={2025},
      eprint={2505.21523},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.21523}, 

About

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%