Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta, Chenguang Zhu, Zeyi Huang, James M. Rehg, Sangmin Lee, Ning Zhang, Tong Xiao
conda env create -f environment.yaml # The env name is "instamanip"
Download the dataset collected in the work InstructPix2Pix. Unzip all the 30 zip files into the path ./data/ip2p/
.
Download the following pre-trained checkpoints and save them under ./pretrained
.
Move cvlm_llama2_tokenizer_100img_and_224loc_addpatch
, seed_detokenizer
and seed_x
from SEED-X-17B
to ./pretrained
.
Replace the added_tokens.json
under cvlm_llama2_tokenizer_100img_and_224loc_addpatch
with our released json file in ./pretrained
.
mv ./pretrained/added_tokens.json ./pretrained/cvlm_llama2_tokenizer_100img_and_224loc_addpatch/
Please run the following script to save the weights of visual encoder of Qwen-VL-Chat
to ./pretrained/QwenViT
.
python src/tools/reload_qwen_vit.py
Finally, you should have the following directories under ./pretrained
. We don't need the other files.
./pretrained
|
|- QwenViT
|- cvlm_llama2_tokenizer_100img_and_224loc_addpatch
|- seed_detokenizer
|- seed_x
|- stable-diffusion-xl-base-1.0
Our model weights are available on HuggingFace. There are four models released in this repo.
- InstaManip-17B-1shot: model trained specifically for 1-shot image manipulation.
- InstaManip-17B-2shot: model trained specifically for 2-shot image manipulation.
- InstaManip-17B-3shot: model trained specifically for 3-shot image manipulation.
- InstaManip-17B-dynamic: model trained for arbitrary amount of exemplar image pairs.
We provide a few examples in ./demo
for a quick start of our model. After setting up the environment and downloading all pre-trained checkpoints and our model weight, run the following command to edit a given image.
# 1-shot
python src/inference/run_model.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin
# multi-shot
python src/inference/run_model_multishot.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin
You can try different examples or use your own image by updating source_image_path
, exemplar_source_image_path
, exemplar_target_image_path
and instruction
in src/inference/run_model.py
and src/inference/run_model_multishot.py
.
Run the following command to train the model on 8 GPUs. You can change the number of GPUs by updating --nproc_per_node
in train.sh
.
bash scripts/train.sh
You can use different hyperparameters in scripts/train.sh
(e.g., learning rate, iterateions) and configs/data/dataset.yaml
(e.g., batch size, number of exemplar images).
We also enable torch.multiprocessing.set_start_method("spawn")
in scripts/train.sh
for training on H100. If you run the code on A100, this line can be commented out for faster training.
Go to the checkpont directory that you want to evaluate. Convert the model weights.
python zero_to_fp32.py . ./pytorch_model.bin
Go back to the project root directory and run the following commands. The inference results will be saved in checkpoint-xxxx/inference-xxxx-xx
.
Using one pair of exemplar images (1-shot):
# In distribution
python src/inference/eval_model.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin --setting in_dist
# Out of distribution
python src/inference/eval_model.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin --setting out_of_dist
Using multiple exemplar images (few-shot):
# In distribution
python src/inference/eval_model_multishot.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin --example_num 2 --setting in_dist
# Out of distribution
python src/inference/eval_model_multishot.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin --example_num 2 --setting out_of_dist
# Out of distribution (diverse)
python src/inference/eval_model_multishot.py --ckpt ./train_output/your_path/checkpoint-xxxx/pytorch_model.bin --example_num 2 --setting out_of_dist_diverse
Most instructions have 3-4 instances in the dataset of IP2P. The model will use duplicate exemplar images if example_num
is set above the available instances.
Run the following command.
python src/metrics/metrics.py --gen_path ./train_output/your_path/checkpoint-xxxx/inference-xxxx-xx
If you find our paper helpful to your work, please cite with this BibTex.
@inproceedings{lai2025unleashing,
title={Unleashing in-context learning of autoregressive models for few-shot image manipulation},
author={Lai, Bolin and Juefei-Xu, Felix and Liu, Miao and Dai, Xiaoliang and Mehta, Nikhil and Zhu, Chenguang and Huang, Zeyi and Rehg, James M and Lee, Sangmin and Zhang, Ning and Xiao, Tong},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={18346--18357},
year={2025}
}
Our work was developed based on SEED-X. We appreciate the contributors for their awesome codebase.