Skip to content

UMass-Embodied-AGI/Mirage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

arXiv 2025

Zeyuan Yang*, Xueyang Yu*, Delin Chen, Maohao Shen, Chuang Gan

Paper PDF Project Page

We propose Mirage, interleaving latent visual tokens, which represent compact imagery visual features, with explicit text tokens to solve diverse multimodal reasoning tasks, boosting the reasoning performance without the full pixel-level image generation.

Logo


Tabel of Contents
  1. Installation
  2. Data Preparation
  3. Training
  4. Inference
  5. Citation
  6. Acknowledgement

News

  • [2025-07-23] We have released our test code, data, and model weights for VSP spatial planning task.
  • [2025-07-09] We have released our training data for VSP spatial planning task.
  • [2025-06-19] We have released the training code!

Installation

Create a conda environment and install the required packages:

conda create -n mirage python=3.10
conda activate mirage

git clone https://github.com/UMass-Embodied-AGI/Mirage.git
cd Mirage
pip install -r requirements.txt
pip install -e ./transformers/.

Data Preparation

We provide a sample dataset of 100 examples for the VSP spatial reasoning task. Please format your data file as follows:

{
    "text_input": "Question",
    "text_output": "Answer",
    "image_input": "input1.jpg",
    "image_output": "helper_image.jpg"
}

We also provide the training and test data for VSP spatial planning task. To extract the contents:

cd ./data/vsp_spatial_planning
tar -xzf vsp_spatial_planning.tar.gz
tar -xzf vsp_spatial_planning_test.tar.gz

Training

We train our model in two stages:

  • Stage 1 jointly supervises text and latent visual tokens, grounding the latter in the visual subspace.
  • Stage 2 drops the latent supervision, anchoring the grounded latent tokens for subsequent text generation.

Logo

Run the following commands to reproduce the training. Make sure to configure the data_path and model_path as needed. The base model (Qwen2.5-VL) will be automatically downloaded in ./cache, specify cache_dir if you want to change huggingface download folder.

Training Stage 1

python src/main.py \
    --model Qwen/Qwen2.5-VL-7B-Instruct --epochs 15 \
    --task vsp-spatial-planning \
    --latent_size 4 \
    --gradient_accumulation_steps 8 \
    --stage stage1 \
    --data_path ./data/vsp_spatial_planning/train_direct.jsonl \
    --log_file ./log.txt \
    --save_model_path ./checkpoints/model_stage1  \
    --cache_dir PATH_TO_HF_CACHE

Training Stage 2

python src/main.py \
    --model Qwen/Qwen2.5-VL-7B-Instruct --epochs 15 \
    --task vsp-spatial-planning \
    --latent_size 4 \
    --gradient_accumulation_steps 1 \
    --stage stage2 \
    --data_path ./data/vsp_spatial_planning/train_direct.jsonl \
    --log_file ./log.txt \
    --load_model_path ./checkpoints/model_stage1 \
    --save_model_path ./checkpoints/model_stage2 \
    --cache_dir PATH_TO_HF_CACHE

Inference

You can run the test code using the command below. Currently, we provide model checkpoints for the VSP spatial planning task training without CoT. We will continue updating the model weights and scaling the dataset to further improve performance.

python src/test.py \
    --model Qwen/Qwen2.5-VL-7B-Instruct --epochs 15 \
    --task vsp-spatial-planning \
    --data_path ./data/vsp_spatial_planning/test_direct.jsonl \
    --load_model_path Miiche/vsp_spatial_planning_direct_sft  \
    --cache_dir PATH_TO_HF_CACHE

Citation

If you find our work useful, please consider citing:

@article{yang2025machine,
  title={Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens}, 
  author={Zeyuan Yang and Xueyang Yu and Delin Chen and Maohao Shen and Chuang Gan},
  year={2025},
  eprint={2506.17218},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.17218}, 
}

Acknowledgement

We would like to thank the following works for their code and models:

We are extremely grateful to Haoyu Zhen, Bairu Hou, Guangtao Zeng, Yuncong Yang, Jiaben Chen, Ziwei Liu, Zonghan Yang, Sunli Chen, Lixing Fang, and many other friends in our Embodied AGI Lab for their helpful feedback and insightful discussions.

About

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •