Bootstrap3D

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data
Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong Yuanjun Xiong, Dahua Lin, Jiaqi Wang

📜 News

🚀 [2025/6/23] Bootstrap3D is accepted by ICCV 2025!

🚀 [2024/6/4] The paper and project page are released!

💡 Highlights

🔥 A new Multi-View Diffusion model trained on high quality synthetic data and capable of generating multi-view images closely follow text prompt.
🔥 Denser captioned Objaverse Dataset using finetuned 3D aware MV-LLaVA powered by GPT-4V.
🔥 A High Quality synthetic dataset for high asethetic 3D content creation.

👨‍💻 Todo

Training code of MV-Diffusion model based on PixArt.
Release of MV-PixArt-alpha.
BS-Objaverse Dataset cart launched on huggingface.
MV-LLaVA model and web demo.
Paper and project page.

⚡ Quick Start

inference

Install diffuser with PixArt supported.

import os
from diffusers import PixArtAlphaPipeline
import torch
from diffusers import Transformer2DModel
import json
import matplotlib.pyplot as plt
import numpy as np
import textwrap

pip_dict = {512: "PixArt-alpha/PixArt-XL-2-512x512",
            1024: "PixArt-alpha/PixArt-XL-2-1024-MS"}

from PIL import Image

transformer = Transformer2DModel.from_pretrained(pretrained_model_name_or_path="Zery/MVPixArt-XL-2-512x512_sv3d", torch_dtype=torch.float16)
pipe = PixArtAlphaPipeline.from_pretrained(pip_dict[resolution], torch_dtype=torch.float16, transformer=transformer)
pipe = pipe.to("cuda")
typ = "sv3d"

prompt = "a cute puppy."

prompt_cad = f"[Four images from DIFFERENT views of a single-object with CAD style] " + prompt

# Generate images for each style
image_cad = pipe(prompt=prompt_cad).images[0]
# Save individual images (optional)
image_cad.save(f"puppy.jpg")

Training

To reproduce our result:

use prompt_gen/prompt_gen.py ask GPT-4 to generate arbitrary number of prompts
use Pixart-Alpha generate image based on prompts
use prompt_gen/gpt_quality_check.py to generate quality check based on GPT-4V
use prompt_gen/instructions.py to generate instructions to prompt-tune LLaVA
clone code of ShareGPT4V, prepare their training environment and use generated instructions to finetune MV-LLaVA (detailed in next section).
generate more data based on MV-LLaVA and formate data into Pixart-Alpha formate
clone Pixart-Alpha repo and prepare their environments, put train/PixArt_xl2_img512_internal_for_3d_sample_training_long.py in config folder, sup_file/train/train_tri.py in train_script folder, sup_file/train/train_mv_pixart_512.sh in . and use a slurm supported cluster to launch the script.

MV-LLaVA

📜 News

MV-LLaVA is trained on 30K GPT-4V generated instructive conversation pairs, enable LLaVA to process multi-view images rendered from 3D content, chat about it and generate dense descriptive captions or provide quality estimation.

It's 7B model is available on huggingface

We use this model to provide quality estimation on Objaverse and rewrite dense descriptive captions, We call this caption dataset BS-Objaverse(BootStrap-Objaverse), it is now available on huggingface.

We also use this model to process synthetic multi-view images generated by SV3D and Zero123++.

🛠️ Usage

Installation (Infer only)

Our MV-LLaVA is based on ShareGPT-4V, thanks for their awesome work! You can clone our repo and cd MV_LLaVA && pip install -e . to install share4v package.

launch our demo through python app.py
batch inference your multi-view images using batch scripts in tools/

Installation (Training)

training demo clone our repo and cd MV_LLaVA && pip install -e . to install share4v package. first use bash scripts/slurm_pretrain_7b_mv.sh to align CLIP with LLama, than run bash scripts/slrum_finetune_7b_mv.sh to do instruct tuning.

we have uploaded a demo objaverse multi-view data (10 images only) in data/obj_demo, its json for pretraining and instruct tuning are available in data/demo_obj_pretrain.json and data/demo_obj_instruct.json. You can generate your own data following the same format. It's worth noticing that pretraining data only support single-turn conversation.

You can overlook the modification here to MV-LLaVA's modification based on Share4V. During your own special usage, you only need to focus on these lines of code.

If you only need to change training data, you can focus on line of codes with modify tag (search this tag in your IDE).

full data preparation (Objaverse)

download full cap3D dataset of objaverse rendered images.
download BS-Objaverse dataset GPT-4V generated annotations obj_descript_gpt_10k.json, convert its into the similar format as demo did.
prepare share4v dataset (optional to mitigate overfitting)

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@misc{sun2024bootstrap3dimprovingmultiviewdiffusion,
      title={Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data}, 
      author={Zeyi Sun and Tong Wu and Pan Zhang and Yuhang Zang and Xiaoyi Dong and Yuanjun Xiong and Dahua Lin and Jiaqi Wang},
      year={2024},
      eprint={2406.00093},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.00093}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
MV_LLaVA		MV_LLaVA
assets		assets
eval		eval
prompt_gen		prompt_gen
train		train
.gitignore		.gitignore
README.md		README.md
infer.py		infer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bootstrap3D

📜 News

💡 Highlights

👨‍💻 Todo

⚡ Quick Start

inference

Training

MV-LLaVA

📜 News

🛠️ Usage

Installation (Infer only)

Installation (Training)

✒️ Citation

About

Uh oh!

Releases

Packages

Languages

SunzeY/Bootstrap3D

Folders and files

Latest commit

History

Repository files navigation

Bootstrap3D

📜 News

💡 Highlights

👨‍💻 Todo

⚡ Quick Start

inference

Training

MV-LLaVA

📜 News

🛠️ Usage

Installation (Infer only)

Installation (Training)

✒️ Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages