GitHub

Synethic Visual Genome [SVG] 🎨

Synthetic Visual Genome: the first automatically generated large-scale scene graph dataset with diverse open-set categories, fine-grained regions, and densely annotated relationships.

Related Resources 🔗

Website: link
Paper: arxiv
Demo (Coming soon):

Updates 📌

[2025/06/10]: 🔥🚀 Initial release of Synthetic Visual Genome with inferen code for ROBIN-3B, and SVG dataset release.

Installation 🛠️

Clone this repository and navigate to SyntheticVG folder.

git clone https://github.com/jamespark3922/SyntheticVG.git
cd SyntheticVG

Install packages

conda create -n svg python=3.10 -y
conda activate svg
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install flash attention for training cases

pip install flash-attn==2.6.3 --no-build-isolation

Dataset 🌟

You can download the syntheticv visual genome (SVG) dataset from Hugging Face Datasets:

SVG Dataset: 🤗 hf-dataset
- SVG Relations: link
- SVG SG: link

Checkpoints 🤖

Robin-3b Stage 2 [Ours]: 🤗 hf-model
Robin-3b Stage 1: TBD
Robin-3b Stage 0: TBD

Quick Start: Scene Graph Generation with SAM 🚀

Generate scene graph for each image, using segment-anything masks and optional GroundingDINO object regions

First install Segment Anything

pip install git+https://github.com/facebookresearch/segment-anything.git

Download all the checkpoints:

ViT-H SAM model
Robin-3b
- Run git clone https://huggingface.co/jamepark3922/robin-qwen2.5-3b-sg-stage2
CLIP-convnext

The default path of all the checkpoints:

├── demo
    ├── checkpoints
    │   ├── robin-qwen2.5-3b-sg-stage2
    │   └── sam_vit_h_4b8939.pth 
    └── open_clip_pytorch_model.bin

Note: You might need to change the "mm_vision_tower" in config.json of robin-3b model to the Absolute Path of open_clip_pytorch_model.bin.

Scene Graph Generation for Single Image 🖼️

import requests

from segment_anything import sam_model_registry

from svg.pipeline.region_proposal.region_generator import SamGroundingDinoRegionGenerator
from svg.pipeline.grounding.grounding_dino import GroundingDinoSAM
from svg.pipeline.captioning.gpt4o import GPT4Captioner
from svg.pipeline.robin import RobinPipeline
from svg.draw_utils import  visualize_masks

image = Image.open(requests.get('http://farm4.staticflickr.com/3377/3573516590_a1f6cf2cbd_z.jpg', stream=True).raw)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

sam_ckpt = 'sam_vit_h_4b8939.pth'
sam_model = sam_model_registry["vit_h"](checkpoint=sam_ckpt).to(device)

# Optional: grounding_dino + gpt4o captioner for additional region grounding
print('Loading GroundingDino model...')
grounding_model = GroundingDinoSAM(
    "IDEA-Research/grounding-dino-base", 
    sam_model, 
    device
)
captioner = GPT4Captioner()
region_generator = SamGroundingDinoRegionGenerator(
    sam_model=sam_model,
    grounding_model=grounding_model, # None if not using
    captioner=captioner
)
regions: list[dict] = region_generator.generate_regions(image, region_mode='merged')

# Generate scene graph from regions
model = RobinPipeline(robin_path, device=device)
sg, _ = model.generate_scene_graph(im, regions)
objects: list[str] = sg['objects']
relations: list[tuple[int, int, str]] = sg['relations']

# Visualize the scene graph
image_rgb = np.array(image)
image_with_masks: np.ndarray = visualize_masks(
    image_rgb, regions, 
    draw_bbox=True, draw_mask = True, draw_polygon=False,
    white_padding=50
    )
cv2.imwrite('scene_graph.jpg', image_with_masks)
with open('scene_graph.json', 'w') as f:
    json.dump(scene_graph, f, indent=4)

You can also run predict.py to generate scene graph for a single image.

python predict.py --image assets/skateboard.png \
    --robin_ckpt path/to/robin-qwen2.5-3b-sg-stage2 \
    --sam_ckpt path/to/sam_vit_h_4b8939.pth

Training 🚀

Coming Soon

TODO List 📝

Release the checkpoints, inference codes and demo.
Release the code for scene graph generation pipeline.
Release the dataset and training scripts.
Release the evaluation code.
Support vllm for fast inference.
Release the code for GPT-4 generated stage 1 data.
Release the code for GPT-4o scene graph refinement to generate stage2 data.

Acknowledgement

Osprey: the codebase and model architecture we built upon.
LLaVA-v1.5: the base training code base.
SAM: the code to generate segmentation masks.
GroundingDINO: code to generate grounding masks.

BibTeX 🖊️

If you find this work useful, please consider citing:

@inproceedings{park2025svg,
  author    = {Park, Jae Sung and Ma, Zixian and Li, Linjie and Zheng, Chenhao and Hsieh, Cheng-Yu and Lu, Ximing and Chandu, Khyathi and Kong, Quan and Kobori, Norimasa and Farhadi, Ali and Choi, Yejin and Krishna, Ranjay},
  title     = {Synthetic Visual Genome: Dense Scene Graphs at Scale with Multimodal Language Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
som		som
svg		svg
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synethic Visual Genome [SVG] 🎨

Related Resources 🔗

Updates 📌

Installation 🛠️

Dataset 🌟

Checkpoints 🤖

Quick Start: Scene Graph Generation with SAM 🚀

Scene Graph Generation for Single Image 🖼️

Training 🚀

TODO List 📝

Acknowledgement

BibTeX 🖊️

About

Uh oh!

Releases

Packages

Languages

License

jamespark3922/SyntheticVG

Folders and files

Latest commit

History

Repository files navigation

Synethic Visual Genome [SVG] 🎨

Related Resources 🔗

Updates 📌

Installation 🛠️

Dataset 🌟

Checkpoints 🤖

Quick Start: Scene Graph Generation with SAM 🚀

Scene Graph Generation for Single Image 🖼️

Training 🚀

TODO List 📝

Acknowledgement

BibTeX 🖊️

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages