Skip to content

jamespark3922/SyntheticVG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synethic Visual Genome [SVG] 🎨

Synthetic Visual Genome: the first automatically generated large-scale scene graph dataset with diverse open-set categories, fine-grained regions, and densely annotated relationships.

Related Resources 🔗

  • Website: link
  • Paper: arxiv
  • Demo (Coming soon):

Updates 📌

[2025/06/10]: 🔥🚀 Initial release of Synthetic Visual Genome with inferen code for ROBIN-3B, and SVG dataset release.

Installation 🛠️

Clone this repository and navigate to SyntheticVG folder.

git clone https://github.com/jamespark3922/SyntheticVG.git
cd SyntheticVG

Install packages

conda create -n svg python=3.10 -y
conda activate svg
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install flash attention for training cases

pip install flash-attn==2.6.3 --no-build-isolation

Dataset 🌟

You can download the syntheticv visual genome (SVG) dataset from Hugging Face Datasets:

Checkpoints 🤖

  • Robin-3b Stage 2 [Ours]: 🤗 hf-model
  • Robin-3b Stage 1: TBD
  • Robin-3b Stage 0: TBD

Quick Start: Scene Graph Generation with SAM 🚀

Generate scene graph for each image, using segment-anything masks and optional GroundingDINO object regions

  1. First install Segment Anything
pip install git+https://github.com/facebookresearch/segment-anything.git
  1. Download all the checkpoints:

The default path of all the checkpoints:

├── demo
    ├── checkpoints
    │   ├── robin-qwen2.5-3b-sg-stage2
    │   └── sam_vit_h_4b8939.pth 
    └── open_clip_pytorch_model.bin

Note: You might need to change the "mm_vision_tower" in config.json of robin-3b model to the Absolute Path of open_clip_pytorch_model.bin.

Scene Graph Generation for Single Image 🖼️

import requests

from segment_anything import sam_model_registry

from svg.pipeline.region_proposal.region_generator import SamGroundingDinoRegionGenerator
from svg.pipeline.grounding.grounding_dino import GroundingDinoSAM
from svg.pipeline.captioning.gpt4o import GPT4Captioner
from svg.pipeline.robin import RobinPipeline
from svg.draw_utils import  visualize_masks

image = Image.open(requests.get('http://farm4.staticflickr.com/3377/3573516590_a1f6cf2cbd_z.jpg', stream=True).raw)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

sam_ckpt = 'sam_vit_h_4b8939.pth'
sam_model = sam_model_registry["vit_h"](checkpoint=sam_ckpt).to(device)

# Optional: grounding_dino + gpt4o captioner for additional region grounding
print('Loading GroundingDino model...')
grounding_model = GroundingDinoSAM(
    "IDEA-Research/grounding-dino-base", 
    sam_model, 
    device
)
captioner = GPT4Captioner()
region_generator = SamGroundingDinoRegionGenerator(
    sam_model=sam_model,
    grounding_model=grounding_model, # None if not using
    captioner=captioner
)
regions: list[dict] = region_generator.generate_regions(image, region_mode='merged')

# Generate scene graph from regions
model = RobinPipeline(robin_path, device=device)
sg, _ = model.generate_scene_graph(im, regions)
objects: list[str] = sg['objects']
relations: list[tuple[int, int, str]] = sg['relations']

# Visualize the scene graph
image_rgb = np.array(image)
image_with_masks: np.ndarray = visualize_masks(
    image_rgb, regions, 
    draw_bbox=True, draw_mask = True, draw_polygon=False,
    white_padding=50
    )
cv2.imwrite('scene_graph.jpg', image_with_masks)
with open('scene_graph.json', 'w') as f:
    json.dump(scene_graph, f, indent=4)

You can also run predict.py to generate scene graph for a single image.

python predict.py --image assets/skateboard.png \
    --robin_ckpt path/to/robin-qwen2.5-3b-sg-stage2 \
    --sam_ckpt path/to/sam_vit_h_4b8939.pth

Training 🚀

Coming Soon

TODO List 📝

  • Release the checkpoints, inference codes and demo.
  • Release the code for scene graph generation pipeline.
  • Release the dataset and training scripts.
  • Release the evaluation code.
  • Support vllm for fast inference.
  • Release the code for GPT-4 generated stage 1 data.
  • Release the code for GPT-4o scene graph refinement to generate stage2 data.

Acknowledgement

  • Osprey: the codebase and model architecture we built upon.
  • LLaVA-v1.5: the base training code base.
  • SAM: the code to generate segmentation masks.
  • GroundingDINO: code to generate grounding masks.

BibTeX 🖊️

If you find this work useful, please consider citing:

@inproceedings{park2025svg,
  author    = {Park, Jae Sung and Ma, Zixian and Li, Linjie and Zheng, Chenhao and Hsieh, Cheng-Yu and Lu, Ximing and Chandu, Khyathi and Kong, Quan and Kobori, Norimasa and Farhadi, Ali and Choi, Yejin and Krishna, Ranjay},
  title     = {Synthetic Visual Genome: Dense Scene Graphs at Scale with Multimodal Language Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages