- [May 2025] Paper is now available. 📢
- Release visualization tools and reasoning length control strategies.
- Release small-scale RH-Bench benchmark.
- Expand and refine RH-Bench to support more multimodal reasoning model. Coming soon!
Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, we observe that this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more on language priors. Attention analysis reveals that longer reasoning chains reduce focus on visual inputs, contributing to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model’s perception accuracy changes with reasoning length, enabling evaluation of whether the model preserves visual grounding while reasoning. We also release RH-Bench, a diagnostic benchmark covering diverse multimodal tasks, designed to jointly assess the balance of reasoning ability and hallucination. We find that (i) larger models generally exhibit a better balance between reasoning and perception; and (ii) this balance depends more on the types and domains of the training data than its volume. Our findings highlight the need for evaluation frameworks that account for both reasoning quality and perceptual reliability.
python heatmap.py \
--image_path /data/image.jpg \
--question "Describe this image in detail."
python layer_analysis.py \
--model-path "R1-OneVision/" \
--image-folder "images/" \
--question-file "question.jsonl" \
--answers-file "./results.pt" \
--plot-path "./attention_distribution.png"
Refer to budget_forcing.py and Scaling_more.py in the length_control directory.
Step 1 Collect responses from multimodal reasoning models on various tasks to extract hidden states of internal attention later.
python generate_response_your_data.py \
--input "/your/dataset/path/annotation.jsonl" \
--output "/your/output/path/results.jsonl" \
--model_id "/your/model/path/ModelDirectory/" \
--num_samples 100 \
--device "cuda:2"
Step 2 Extract per-layer directional vectors from the residual inputs of the self-attention mechanism in multimodal reasoning models. It supports two modes: text mode, which processes only the question and thinking tokens, and vision mode, which processes the image along with the question and thinking tokens.
python get_direction.py \
--model /path/to/your/model/ \
--json_path /path/to/your/response.jsonl \
--output_path /path/to/save/steering_direction.pt \
--mode text
Step 3 Control the reasoning length of the multimodal model and obtain responses under each steering state. The current range is [-0.1,0.1], which can be adjusted according to different datasets and tasks. Note: Extremely large or small parameter values may degrade the model's performance. Support both automated sweeping of parameters via range input and manual specification of individual values.
python steering_mlrm.py \
--dataset /path/to/dataset.jsonl \
--output results/output.jsonl \
--model_id /path/to/model \
--image_root /path/to/images \
--direction_path /path/to/direction_vector.pt \
--direction_weights_range -0.1 0.1 0.01 \ # (start, end, and step )
--num_samples 100 \
--device cuda:0
Model | Link |
---|---|
R1-Onevision | 🤗 R1-Onevision |
ThinkLite-VL | 🤗 ThinkLite-VL |
MM-Eureka-Qwen | 🤗 MM-Eureka-Qwen |
Vision-R1 | 🤗 Vision-R1 |
Ocean-R1 | 🤗 Ocean-R1 |
MM-R1 | 🤗 MM-R1 |
Curr-ReFT | 🤗 MM-R1 |
LLM- R1 | 🤗 LLM-R1 |
Skywork-R1V | 🤗 Skywork-R1V |
Control the reasoning length using the provided method or other approaches on RH-Bench or other datasets. Evaluate the responses and visualize the RH-AUC scores. Future updates will include more efficient methods for reasoning-hallucination balance evaluation. 🔔
# Reason
python evaluation_rhbench_reason.py \
--input_dir "/data/steering_reason/" \
--output_dir "/data/steering_reason/score" \
--summary_file "/data/steering_reason/evaluation_summary.txt"
# Hallucination
python evaluation_rhbench_perception.py \
--input_dir "/data/steering_hallu/" \
--output_dir "/data/steering_hallu/score" \
--summary_file "/data/steering_hallu/evaluation_summary.txt"
# RH-AUC Score -- Adjust according to your file format or method.
python RH-AUC.py --txt_file_reason '/path/to/your/evaluation_summary_reason.txt' --txt_file_hallu '/path/to/your/evaluation_summary_hallucination.txt'
If you find the code is valuable, please use this citation.
@misc{liu2025thinkingseeingassessingamplified,
title={More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models},
author={Chengzhi Liu and Zhongxing Xu and Qingyue Wei and Juncheng Wu and James Zou and Xin Eric Wang and Yuyin Zhou and Sheng Liu},
year={2025},
eprint={2505.21523},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.21523},