Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

arXiv 2025

Zeyuan Yang*, Xueyang Yu*, Delin Chen, Maohao Shen, Chuang Gan

We propose Mirage, interleaving latent visual tokens, which represent compact imagery visual features, with explicit text tokens to solve diverse multimodal reasoning tasks, boosting the reasoning performance without the full pixel-level image generation.

Tabel of Contents

Installation
Data Preparation
Training
Inference
Citation
Acknowledgement

News

[2025-07-23] We have released our test code, data, and model weights for VSP spatial planning task.
[2025-07-09] We have released our training data for VSP spatial planning task.
[2025-06-19] We have released the training code!

Installation

Create a conda environment and install the required packages:

conda create -n mirage python=3.10
conda activate mirage

git clone https://github.com/UMass-Embodied-AGI/Mirage.git
cd Mirage
pip install -r requirements.txt
pip install -e ./transformers/.

Data Preparation

We provide a sample dataset of 100 examples for the VSP spatial reasoning task. Please format your data file as follows:

{
    "text_input": "Question",
    "text_output": "Answer",
    "image_input": "input1.jpg",
    "image_output": "helper_image.jpg"
}

We also provide the training and test data for VSP spatial planning task. To extract the contents:

cd ./data/vsp_spatial_planning
tar -xzf vsp_spatial_planning.tar.gz
tar -xzf vsp_spatial_planning_test.tar.gz

Training

We train our model in two stages:

Stage 1 jointly supervises text and latent visual tokens, grounding the latter in the visual subspace.
Stage 2 drops the latent supervision, anchoring the grounded latent tokens for subsequent text generation.

Run the following commands to reproduce the training. Make sure to configure the data_path and model_path as needed. The base model (Qwen2.5-VL) will be automatically downloaded in ./cache, specify cache_dir if you want to change huggingface download folder.

Training Stage 1

python src/main.py \
    --model Qwen/Qwen2.5-VL-7B-Instruct --epochs 15 \
    --task vsp-spatial-planning \
    --latent_size 4 \
    --gradient_accumulation_steps 8 \
    --stage stage1 \
    --data_path ./data/vsp_spatial_planning/train_direct.jsonl \
    --log_file ./log.txt \
    --save_model_path ./checkpoints/model_stage1  \
    --cache_dir PATH_TO_HF_CACHE

Training Stage 2

python src/main.py \
    --model Qwen/Qwen2.5-VL-7B-Instruct --epochs 15 \
    --task vsp-spatial-planning \
    --latent_size 4 \
    --gradient_accumulation_steps 1 \
    --stage stage2 \
    --data_path ./data/vsp_spatial_planning/train_direct.jsonl \
    --log_file ./log.txt \
    --load_model_path ./checkpoints/model_stage1 \
    --save_model_path ./checkpoints/model_stage2 \
    --cache_dir PATH_TO_HF_CACHE

Inference

You can run the test code using the command below. Currently, we provide model checkpoints for the VSP spatial planning task training without CoT. We will continue updating the model weights and scaling the dataset to further improve performance.

python src/test.py \
    --model Qwen/Qwen2.5-VL-7B-Instruct --epochs 15 \
    --task vsp-spatial-planning \
    --data_path ./data/vsp_spatial_planning/test_direct.jsonl \
    --load_model_path Miiche/vsp_spatial_planning_direct_sft  \
    --cache_dir PATH_TO_HF_CACHE

Citation

If you find our work useful, please consider citing:

@article{yang2025machine,
  title={Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens}, 
  author={Zeyuan Yang and Xueyang Yu and Delin Chen and Maohao Shen and Chuang Gan},
  year={2025},
  eprint={2506.17218},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.17218}, 
}

Acknowledgement

We would like to thank the following works for their code and models:

Training: Coconut, Qwen and MVoT
Datasets: VSP, Blink, COMT and SAT

We are extremely grateful to Haoyu Zhen, Bairu Hou, Guangtao Zeng, Yuncong Yang, Jiaben Chen, Ziwei Liu, Zonghan Yang, Sunli Chen, Lixing Fang, and many other friends in our Embodied AGI Lab for their helpful feedback and insightful discussions.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
asset		asset
data		data
src		src
transformers		transformers
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

News

Installation

Data Preparation

Training

Inference

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

UMass-Embodied-AGI/Mirage

Folders and files

Latest commit

History

Repository files navigation

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

News

Installation

Data Preparation

Training

Inference

Citation

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages