Skip to content

Official Implementation for paper: BIFRÖST: 3D-Aware Image Compositng with Language Instructions

Notifications You must be signed in to change notification settings

QuanHoangDanh/Bifrost

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 

Repository files navigation

BIFRÖST: 3D-Aware Image compositing with Language Instructions

arXiv  project page  Hugging Face  MIT License 

This repository includes the official pytorch implementation of BIFRÖST, presented in our paper:

BIFRÖST: 3D-Aware Image compositing with Language Instructions

Lingxiao Li, Kaixiong Gong, Weihong Li, Xili Dai, Tao Chen, Xiaojun Yuan,Xiangyu Yue

MMLab, CUHK & HKUST (GZ) & Fudan University & UESTC

Update

  • [2025.1] Code and weights are released!
  • [2024.10] arXiv preprint is available.

Introduction

This paper introduces Bifröst, a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Previous methods concentrate on image compositing at the 2D level, which fall short in handling complex spatial relationships (e.g., occlusion). Bifröst addresses these issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process to bridge the gap between 2D and 3D, which enhances spatial comprehension and supports sophisticated spatial interactions. Our method begins by fine-tuning MLLM with a custom counterfactual dataset to predict 2.5D object locations in complex backgrounds from language instructions. Then, the image-compositing model is uniquely designed to process multiple types of input features, enabling it to perform high-fidelity image compositions that consider occlusion, depth blur, and image harmonization. Extensive qualitative and quantitative evaluations demonstrate that Bifröst significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding. This work not only pushes the boundaries of generative image compositing but also reduces reliance on expensive annotated datasets by effectively utilizing existing resources in innovative ways.

Installation

Installation with MLLM

Follow the instruction with the original LLaVA installation, check LLaVA_Bifrost/README.md

Installation with Image Compositing

Install with conda:

conda env create -f environment.yaml
conda activate bifrost

or pip:

pip install -r requirements.txt

Additionally, for training, you need to install panopticapi, pycocotools, and lvis-api.

pip install git+https://github.com/cocodataset/panopticapi.git

pip install pycocotools

pip install lvis

Download Checkpoints

Download Bifrost checkpoint:

Download DINOv2 checkpoint and revise /configs/anydoor.yaml for the path (line 103)

Download Stable Diffusion V2.1 if you want to train from scratch.

We also support train from the checkpoint of AnyDoor

Inference

Inference Bifrost MLLM

The code is provided in the folder LLaVA_Bifrost First place the downloaded checkpoint folder in LLaVA_Bifrost/llava/checkpoints We provide inference code in LLaVA_Bifrost/llava/eva/run_llava.py (from Line 154 - ). You should modify the data path and run the following code. The generated results are provided in the path you set.

python llava/eva/run_llava.py

where the outputs of MLLM can be used as part of the input of image compositing model

Inference Bifrost Image Compositing

The code is provided in the folder Main_Bifrost We provide inference code in run_inference.py (from Line 370 - ) for both inference of a single image and inference of a dataset (DreamBooth Test). You should modify the data path and run the following code. Replace the location of the bounding box and the value of depth with the value from the MLLM in the first stage. The generated results are provided in the path you set.

sh scripts/inference.sh

Train

Train MLLM

Prepare datasets for fine-tuning MLLM

  • Download MS-COCO dataset
  • Download the corresponding code and weight of DPT(to predict the depth) and Inpaint Anything(to fill the hole) firstly.
  • Using the script we provided in LLaVA_Bifrost/create_dataset.py to create the customized counterfactual dataset. You should modify the data path and run the following code.
python create_dataset.py

Prepare initial weight

  • Download the initial weight from huggingface.
  • You should modify the data path.

Start training

  • Modify the training hyper-parameters in train.sh.

  • You should modify the data path.

  • Start training by executing:

sh train.sh  

Train Image Compositing model

Prepare datasets for Image Compositing

  • Download the datasets that are present in /configs/datasets.yaml and modify the corresponding paths.
  • You could prepare your own datasets according to the format of files in ./datasets.
  • If you use Uthe VO dataset, you need to process the json following ./datasets/Preprocess/uvo_process.py
  • You could refer to run_dataset_debug.py to verify your data is correct.

Prepare initial weight

  • If you would like to train from scratch, convert the downloaded SD weights to control copy by running:
sh ./scripts/convert_weight.sh  
  • If you would like to train from the checkpoint of Anydoor, convert the download AnyDoor checkpoint to control copy by running:
sh ./scripts/convert_weight_anydoor.sh 

Start training

  • Modify the training hyper-parameters in run_train_bifrost.py Line 29-38 according to your training resources. We verify that using 2-A100 GPUs with batch accumulation=1 could get satisfactory results after 200,000 iterations. (You need at least one A100 GPU to train the model.)

  • Start training by executing:

sh ./scripts/train.sh  

Acknowledgements

The code is built upon ControlNet, AnyDoor. Thank for their great work.

Citation

@INPROCEEDINGS{Li24,
  title = {BIFRÖST: 3D-Aware Image compositing with Language Instructions},
  author = {Lingxiao Li and Kaixiong Gong and Weihong Li and Xili Dai and Tao Chen and Xiaojun Yuan and Xiangyu Yue},
  booktitle={Advanced Neural Information Processing System (NeurIPS)},
  year={2024}
}

You can contact me via email lingxiaoli98@gmail.com if any questions.

About

Official Implementation for paper: BIFRÖST: 3D-Aware Image Compositng with Language Instructions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.5%
  • Shell 2.8%
  • Other 1.7%