Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

CVPR 2025 (Highlight)

Project Page | Paper | HuggingFace

Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta, Chenguang Zhu, Zeyi Huang, James M. Rehg, Sangmin Lee, Ning Zhang, Tong Xiao

Setup

Environment

conda env create -f environment.yaml  # The env name is "instamanip"

Dataset

Download the dataset collected in the work InstructPix2Pix. Unzip all the 30 zip files into the path ./data/ip2p/.

Pre-trained Checkpoints

Download the following pre-trained checkpoints and save them under ./pretrained.

Move cvlm_llama2_tokenizer_100img_and_224loc_addpatch, seed_detokenizer and seed_x from SEED-X-17B to ./pretrained.

Replace the added_tokens.json under cvlm_llama2_tokenizer_100img_and_224loc_addpatch with our released json file in ./pretrained.

mv ./pretrained/added_tokens.json ./pretrained/cvlm_llama2_tokenizer_100img_and_224loc_addpatch/

Please run the following script to save the weights of visual encoder of Qwen-VL-Chat to ./pretrained/QwenViT.

python src/tools/reload_qwen_vit.py

Finally, you should have the following directories under ./pretrained. We don't need the other files.

./pretrained
     |
     |- QwenViT
     |- cvlm_llama2_tokenizer_100img_and_224loc_addpatch
     |- seed_detokenizer
     |- seed_x
     |- stable-diffusion-xl-base-1.0

Model Weights

Our model weights are available on HuggingFace. There are four models released in this repo.

InstaManip-17B-1shot: model trained specifically for 1-shot image manipulation.
InstaManip-17B-2shot: model trained specifically for 2-shot image manipulation.
InstaManip-17B-3shot: model trained specifically for 3-shot image manipulation.
InstaManip-17B-dynamic: model trained for arbitrary amount of exemplar image pairs.

Quick Start

We provide a few examples in ./demo for a quick start of our model. After setting up the environment and downloading all pre-trained checkpoints and our model weight, run the following command to edit a given image.

# 1-shot
python src/inference/run_model.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin

# multi-shot
python src/inference/run_model_multishot.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin

You can try different examples or use your own image by updating source_image_path, exemplar_source_image_path, exemplar_target_image_path and instruction in src/inference/run_model.py and src/inference/run_model_multishot.py.

Training

Run the following command to train the model on 8 GPUs. You can change the number of GPUs by updating --nproc_per_node in train.sh.

bash scripts/train.sh

You can use different hyperparameters in scripts/train.sh (e.g., learning rate, iterateions) and configs/data/dataset.yaml (e.g., batch size, number of exemplar images).

We also enable torch.multiprocessing.set_start_method("spawn") in scripts/train.sh for training on H100. If you run the code on A100, this line can be commented out for faster training.

Evaluation

Go to the checkpont directory that you want to evaluate. Convert the model weights.

python zero_to_fp32.py . ./pytorch_model.bin

Go back to the project root directory and run the following commands. The inference results will be saved in checkpoint-xxxx/inference-xxxx-xx.

Using one pair of exemplar images (1-shot):

# In distribution
python src/inference/eval_model.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin --setting in_dist

# Out of distribution
python src/inference/eval_model.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin --setting out_of_dist

Using multiple exemplar images (few-shot):

# In distribution
python src/inference/eval_model_multishot.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin --example_num 2 --setting in_dist

# Out of distribution
python src/inference/eval_model_multishot.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin --example_num 2 --setting out_of_dist

# Out of distribution (diverse)
python src/inference/eval_model_multishot.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin --example_num 2 --setting out_of_dist_diverse

Most instructions have 3-4 instances in the dataset of IP2P. The model will use duplicate exemplar images if example_num is set above the available instances.

Metrics

Run the following command.

python src/metrics/metrics.py  --gen_path ./train_output/your_path/checkpoint-xxxx/inference-xxxx-xx

BibTex

If you find our paper helpful to your work, please cite with this BibTex.

@inproceedings{lai2025unleashing,
  title={Unleashing in-context learning of autoregressive models for few-shot image manipulation},
  author={Lai, Bolin and Juefei-Xu, Felix and Liu, Miao and Dai, Xiaoliang and Mehta, Nikhil and Zhu, Chenguang and Huang, Zeyi and Rehg, James M and Lee, Sangmin and Zhang, Ning and Xiao, Tong},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={18346--18357},
  year={2025}
}

Acknowledgement

Our work was developed based on SEED-X. We appreciate the contributors for their awesome codebase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

CVPR 2025 (Highlight)

Project Page | Paper | HuggingFace

Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta, Chenguang Zhu, Zeyi Huang, James M. Rehg, Sangmin Lee, Ning Zhang, Tong Xiao

Contents

Setup

Environment

Dataset

Pre-trained Checkpoints

Model Weights

Quick Start

Training

Evaluation

Metrics

BibTex

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
configs		configs
data		data
demo		demo
pretrained		pretrained
proj/peft		proj/peft
scripts		scripts
src		src
.gitignore		.gitignore
.project-root		.project-root
README.md		README.md
environment.yaml		environment.yaml

BolinLai/InstaManip

Folders and files

Latest commit

History

Repository files navigation

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

CVPR 2025 (Highlight)

Project Page | Paper | HuggingFace

Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta, Chenguang Zhu, Zeyi Huang, James M. Rehg, Sangmin Lee, Ning Zhang, Tong Xiao

Contents

Setup

Environment

Dataset

Pre-trained Checkpoints

Model Weights

Quick Start

Training

Evaluation

Metrics

BibTex

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages