This repository includes the official pytorch implementation of BIFRÖST, presented in our paper:
BIFRÖST: 3D-Aware Image compositing with Language Instructions
Lingxiao Li, Kaixiong Gong, Weihong Li, Xili Dai, Tao Chen, Xiaojun Yuan,Xiangyu Yue
MMLab, CUHK & HKUST (GZ) & Fudan University & UESTC
- [2025.1] Code and weights are released!
- [2024.10] arXiv preprint is available.
This paper introduces Bifröst, a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Previous methods concentrate on image compositing at the 2D level, which fall short in handling complex spatial relationships (e.g., occlusion). Bifröst addresses these issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process to bridge the gap between 2D and 3D, which enhances spatial comprehension and supports sophisticated spatial interactions. Our method begins by fine-tuning MLLM with a custom counterfactual dataset to predict 2.5D object locations in complex backgrounds from language instructions. Then, the image-compositing model is uniquely designed to process multiple types of input features, enabling it to perform high-fidelity image compositions that consider occlusion, depth blur, and image harmonization. Extensive qualitative and quantitative evaluations demonstrate that Bifröst significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding. This work not only pushes the boundaries of generative image compositing but also reduces reliance on expensive annotated datasets by effectively utilizing existing resources in innovative ways.
Follow the instruction with the original LLaVA installation, check LLaVA_Bifrost/README.md
Install with conda
:
conda env create -f environment.yaml
conda activate bifrost
or pip
:
pip install -r requirements.txt
Additionally, for training, you need to install panopticapi, pycocotools, and lvis-api.
pip install git+https://github.com/cocodataset/panopticapi.git
pip install pycocotools
pip install lvis
Download Bifrost checkpoint:
- URL: BaiduNetDisk
- URL: HuggingFace
Download DINOv2 checkpoint and revise /configs/anydoor.yaml
for the path (line 103)
Download Stable Diffusion V2.1 if you want to train from scratch.
We also support train from the checkpoint of AnyDoor
The code is provided in the folder LLaVA_Bifrost
First place the downloaded checkpoint folder in LLaVA_Bifrost/llava/checkpoints
We provide inference code in LLaVA_Bifrost/llava/eva/run_llava.py
(from Line 154 - ). You should modify the data path and run the following code.
The generated results are provided in the path you set.
python llava/eva/run_llava.py
where the outputs of MLLM can be used as part of the input of image compositing model
The code is provided in the folder Main_Bifrost
We provide inference code in run_inference.py
(from Line 370 - ) for both inference of a single image and inference of a dataset (DreamBooth Test). You should modify the data path and run the following code.
Replace the location of the bounding box and the value of depth with the value from the MLLM in the first stage. The generated results are provided in the path you set.
sh scripts/inference.sh
- Download MS-COCO dataset
- Download the corresponding code and weight of DPT(to predict the depth) and Inpaint Anything(to fill the hole) firstly.
- Using the script we provided in LLaVA_Bifrost/create_dataset.py to create the customized counterfactual dataset. You should modify the data path and run the following code.
python create_dataset.py
- Download the initial weight from huggingface.
- You should modify the data path.
-
Modify the training hyper-parameters in
train.sh
. -
You should modify the data path.
-
Start training by executing:
sh train.sh
- Download the datasets that are present in
/configs/datasets.yaml
and modify the corresponding paths. - You could prepare your own datasets according to the format of files in
./datasets
. - If you use Uthe VO dataset, you need to process the json following
./datasets/Preprocess/uvo_process.py
- You could refer to
run_dataset_debug.py
to verify your data is correct.
- If you would like to train from scratch, convert the downloaded SD weights to control copy by running:
sh ./scripts/convert_weight.sh
- If you would like to train from the checkpoint of Anydoor, convert the download AnyDoor checkpoint to control copy by running:
sh ./scripts/convert_weight_anydoor.sh
-
Modify the training hyper-parameters in
run_train_bifrost.py
Line 29-38 according to your training resources. We verify that using 2-A100 GPUs with batch accumulation=1 could get satisfactory results after 200,000 iterations. (You need at least one A100 GPU to train the model.) -
Start training by executing:
sh ./scripts/train.sh
The code is built upon ControlNet, AnyDoor. Thank for their great work.
@INPROCEEDINGS{Li24,
title = {BIFRÖST: 3D-Aware Image compositing with Language Instructions},
author = {Lingxiao Li and Kaixiong Gong and Weihong Li and Xili Dai and Tao Chen and Xiaojun Yuan and Xiangyu Yue},
booktitle={Advanced Neural Information Processing System (NeurIPS)},
year={2024}
}
You can contact me via email lingxiaoli98@gmail.com if any questions.