arXiv 2025
Zeyuan Yang*, Xueyang Yu*, Delin Chen, Maohao Shen, Chuang Gan
We propose Mirage, interleaving latent visual tokens, which represent compact imagery visual features, with explicit text tokens to solve diverse multimodal reasoning tasks, boosting the reasoning performance without the full pixel-level image generation.
Tabel of Contents
- [2025-07-23] We have released our test code, data, and model weights for VSP spatial planning task.
- [2025-07-09] We have released our training data for VSP spatial planning task.
- [2025-06-19] We have released the training code!
Create a conda environment and install the required packages:
conda create -n mirage python=3.10
conda activate mirage
git clone https://github.com/UMass-Embodied-AGI/Mirage.git
cd Mirage
pip install -r requirements.txt
pip install -e ./transformers/.
We provide a sample dataset of 100 examples for the VSP spatial reasoning task. Please format your data file as follows:
{
"text_input": "Question",
"text_output": "Answer",
"image_input": "input1.jpg",
"image_output": "helper_image.jpg"
}
We also provide the training and test data for VSP spatial planning task. To extract the contents:
cd ./data/vsp_spatial_planning
tar -xzf vsp_spatial_planning.tar.gz
tar -xzf vsp_spatial_planning_test.tar.gz
We train our model in two stages:
- Stage 1 jointly supervises text and latent visual tokens, grounding the latter in the visual subspace.
- Stage 2 drops the latent supervision, anchoring the grounded latent tokens for subsequent text generation.
Run the following commands to reproduce the training. Make sure to configure the data_path
and model_path
as needed.
The base model (Qwen2.5-VL) will be automatically downloaded in ./cache
, specify cache_dir
if you want to change huggingface download folder.
Training Stage 1
python src/main.py \
--model Qwen/Qwen2.5-VL-7B-Instruct --epochs 15 \
--task vsp-spatial-planning \
--latent_size 4 \
--gradient_accumulation_steps 8 \
--stage stage1 \
--data_path ./data/vsp_spatial_planning/train_direct.jsonl \
--log_file ./log.txt \
--save_model_path ./checkpoints/model_stage1 \
--cache_dir PATH_TO_HF_CACHE
Training Stage 2
python src/main.py \
--model Qwen/Qwen2.5-VL-7B-Instruct --epochs 15 \
--task vsp-spatial-planning \
--latent_size 4 \
--gradient_accumulation_steps 1 \
--stage stage2 \
--data_path ./data/vsp_spatial_planning/train_direct.jsonl \
--log_file ./log.txt \
--load_model_path ./checkpoints/model_stage1 \
--save_model_path ./checkpoints/model_stage2 \
--cache_dir PATH_TO_HF_CACHE
You can run the test code using the command below. Currently, we provide model checkpoints for the VSP spatial planning task training without CoT. We will continue updating the model weights and scaling the dataset to further improve performance.
python src/test.py \
--model Qwen/Qwen2.5-VL-7B-Instruct --epochs 15 \
--task vsp-spatial-planning \
--data_path ./data/vsp_spatial_planning/test_direct.jsonl \
--load_model_path Miiche/vsp_spatial_planning_direct_sft \
--cache_dir PATH_TO_HF_CACHE
If you find our work useful, please consider citing:
@article{yang2025machine,
title={Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens},
author={Zeyuan Yang and Xueyang Yu and Delin Chen and Maohao Shen and Chuang Gan},
year={2025},
eprint={2506.17218},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.17218},
}
We would like to thank the following works for their code and models:
We are extremely grateful to Haoyu Zhen, Bairu Hou, Guangtao Zeng, Yuncong Yang, Jiaben Chen, Ziwei Liu, Zonghan Yang, Sunli Chen, Lixing Fang, and many other friends in our Embodied AGI Lab for their helpful feedback and insightful discussions.