Awesome-Scene-Graph-Generation

📣 News

We're excited to introduce a new section: Hot Topics 🔥!

We'll regularly post interesting discussion topics in the Issues tab. If you're interested, feel free to jump in and share your thoughts! These discussions are purely for idea exchange and community engagement

I'll also be collecting and sharing thought-provoking questions related to the future of scene graphs and scene understanding in general. Everyone is welcome to join the conversation!

🔍 First Topic:

"Are scene graphs still a good way to represent and understand scenes?"

Scene graphs are a form of explicit scene representation. But with the rise of implicit scene representations, is this approach still effective? Which representation is more promising moving forward?

Let us know what you think in the discussion thread!

🎨 Introduction

A scene graph is a topological structure representing a scene described in text, image, video, or etc. In this graph, the nodes correspond to object bounding boxes with their category labels and attributes, while the edges represent the pair-wise relationships between objects.

📕 Table of Contents

🌷 Scene Graph Datasets
🍕 Scene Graph Generation
🥝 Scene Graph Application
🤶 Evaluation Metrics
🐱‍🚀 Miscellaneous
⭐️ Star History

🌷 Scene Graph Datasets

Dataset	Modality	Obj. Class	BBox	Rela. Class	Triplets	Instances
Visual Phrase	Image	8	3,271	9	1,796	2,769
Scene Graph	Image	266	69,009	68	109,535	5,000
VRD	Image	100	-	70	37,993	5,000
Open Images v7	Image	600	3,290,070	31	374,768	9,178,275
Visual Genome	Image	5,996	3,843,636	1,014	2,347,187	108,077
GQA	Image	200	-	310	-	3,795,907
VrR-VG	Image	1,600	282,460	117	203,375	58,983
UnRel	Image	-	-	18	76	1,071
SpatialSense	Image	3,679	-	9	13,229	11,569
SpatialVOC2K	Image	20	5,775	34	9,804	2,026
OpenSG	Image (panoptic)	133	-	56	-	49K
AUG	Image (Overhead View)	76	-	61	-	-
STAR	Satellite Imagery	48	219,120	58	400,795	31,096
ReCon1M	Satellite Imagery	60	859,751	64	1,149,342	21,392
SkySenseGPT	Satellite Imagery (Instruction)	-	-	-	-	-
Traffic Scene Graph	Traffic Image	2，266	-	4,272	-	451
ImageNet-VidVRD	Video	35	-	132	3,219	100
VidOR	Video	80	-	50	-	10,000
Action Genome	Video	35	0.4M	25	1.7M	10,000
AeroEye	Video (Drone-View)	56	-	384	-	2.2M
PVSG	Video (panoptic)	126	-	57	4,587	400
ASPIRe	Video(Interlacements)	-	-	4.5K	-	1.5K
Ego-EASG	Video(Ego-view)	407	-	235	-	-
3D Semantic Scene Graphs (3DSSG)	3D	528	-	39	-	48K
PSG4D	4D	46	-	15	-	-
4D-OR	4D(operating room)	12	-	14	-	-
MM-OR	4D(operating room)	-	-	-	-	-
EgoExOR	4D(operating room)	36	-	22	568,235	-
FACTUAL	Image, Text	4,042	-	1,607	40,149	40,369
TSG Bench	Text	-	-	-	11,820	4,289
DiscoSG-DS	Image, Text	4,018	-	2,033	68,478	8,830

🍕 Scene Graph Generation

2D (Image) Scene Graph Generation

There are three subtasks:

Predicate classification: given ground-truth labels and bounding boxes of object pairs, predict the predicate label.
Scene graph classification: joint classification of predicate labels and the objects' category given the grounding bounding boxes.
Scene graph detection: detect the objects and their categories, and predict the predicate between object pairs.

LLM-based

Compile Scene Graphs with Reinforcement Learning

R1-based model
R1-SGG, a novel framework leveraging visual instruction tuning enhanced by reinforcement learning (RL). The visual instruction tuning stage follows a conventional supervised fine-tuning (SFT) paradigm, i.e., finetuning the model using prompt-response pairs with a cross-entropy loss. For the RL stage, we adopt GRPO, an online policy optimization algorithm, in which an node-level reward and edge-level reward are designed.
Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection
PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks
From Data to Modeling: Fully Open-vocabulary Scene Graph Generation
Open World Scene Graph Generation using Vision Language Models
Synthetic Visual Genome
Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations
Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms
Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency
Scene Graph Generation with Role-Playing Large Language Models
SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding
VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation
SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation
Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World
GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives
Less is More: Toward Zero-Shot Local Scene Graph Generation via Foundation Models
Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge 🙆‍♀️👈

Non-LLM-based

Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation
Navigating the Unseen: Zero-shot Scene Graph Generation via Capsule-Based Equivariant Features
A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation
CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations

Introduce the concept of parametric relations
To eliminate ambiguous predicate definitions, we introduce the concept of parametric relations. In addition to a traditional predicate label, we store a parameter (e.g. an angle or a distance) that enables a more fine-grained representation. We show how existing models can be adapted to the new parametric scene graph generation task. Additionally, we introduce proto-relations as a novel technique for representing hypothetical relations. Given an anchor object and a predicate, a proto-relation describes the volume or area that another object would need to intersect to fulfill the associated relation with the anchor object. Protorelations can encode information such as "somewhere next to the TV" or "the area behind the sofa". This representation will arguably be useful for agents that use scene graphs as their intermediate knowledge state.
HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions
Generalized Visual Relation Detection with Diffusion Models
Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation
Taking A Closer Look at Interacting Objects: Interaction-Aware Open Vocabulary Scene Graph Generation
Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation
RA-SGG: Retrieval-Augmented Scene Graph Generation Framework via Multi-Prototype Learning
Taking A Closer Look at Interacting Objects: Interaction-Aware Open Vocabulary Scene Graph Generation
UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation
Multiview Scene Graph
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction
REACT: Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation
BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation
Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation
Adaptive Self-training Framework for Fine-grained Scene Graph Generation
Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation
Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction
Multi-Granularity Sparse Relationship Matrix Prediction Network for End-to-End Scene Graph Generation
Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection
Leveraging Predicate and Triplet Learning for Scene Graph Generation
DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation
HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
EGTR: Extracting Graph from Transformer for Scene Graph Generation
Generalized Visual Relation Detection with Diffusion Models
STAR: A First-Ever Dataset and A Large-Scale Benchmark for Scene Graph Generation in Large-Size Satellite Imagery
Improving Scene Graph Generation with Relation Words’ Debiasing in Vision-Language Models
Adaptive Visual Scene Understanding: Incremental Scene Graph Generation
Ensemble Predicate Decoding for Unbiased Scene Graph Generation
ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery
RepSGG: Novel Representations of Entities and Relationships for Scene Graph Generation
Hierarchical Relationships: A New Perspective to Enhance Scene Graph Generation
Improving Scene Graph Generation with Superpixel-Based Interaction Learning
Reltr: Relation transformer for scene graph generation
Unbiased Scene Graph Generation via Two-stage Causal Modeling
Zero-Shot Scene Graph Generation via Triplet Calibration and Reduction
Evidential Unvertainty and Diversity Guided Active Learning for Scene Graph Generation
Prototype-based Embedding Network for Scene Graph Generation
IS-GGT: Iterative Scene Graph Generation With Generative Transformers
Learning to Generate Language-supervised and Open-vocabulary Scene Graph using Pre-trained Visual-Semantic Space
Fast Contextual Scene Graph Generation with Unbiased Context Augmentation
Devil’s on the Edges: Selective Quad Attention for Scene Graph Generation
Fine-Grained is Too Coarse: A Novel Data-Centric Approach for Efficient Scene Graph Generation
Vision Relation Transformer for Unbiased Scene Graph Generation
Compositional Feature Augmentation for Unbiased Scene Graph Generation
SGTR: End-to-end Scene Graph Generation with Transformer
The Devil Is in the Labels: Noisy Label Correction for Robust Scene Graph Generation
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation
Fine-Grained Scene Graph Generation with Data Transfer
Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning
Iterative Scene Graph Generation
Unbiased Heterogeneous Scene Graph Generation with Relation-Aware Message Passing Neural Network
VARSCENE: A Deep Generative Model for Realistic Scene Graph Synthesis
Linguistic Structures as Weak Supervision for Visual Scene Graph Generation
CogTree: Cognition Tree Loss for Unbiased Scene Graph Generation
Unconditional Scene Graph Generation
Learning to Generate Scene Graph from Natural Language Supervision
Context-Aware Scene Graph Generation With Seq2Seq Transformers
Generative Compositional Augmentations for Scene Graph Prediction
Visual Distant Supervision for Scene Graph Generation
GPS-Net: Graph Property Sensing Network for Scene Graph Generation
Weakly Supervised Visual Semantic Parsing
Unbiased Scene Graph Generation from Biased Training
Graphical Contrastive Losses for Scene Graph Parsing
Visual Relationship Detection with Language Priors
Learning to Compose Dynamic Tree Structures for Visual Contexts
Knowledge-Embedded Routing Network for Scene Graph Generation
Scene Graph Prediction with Limited Lab
Neural motifs: Scene graph parsing with global context
Scene Graph Generation From Objects, Phrases and Region Captions
Visual Relationship Detection with Language Priors

Panoptic Scene Graph Generation

Compared with traditional scene graph, each object is grounded by a panoptic segmentation mask in PSG, achieving a compresensive structured scene representation.

Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension

R1-enhanced Visual Relation Reasoning
This work introduces a R1-based Unified framework for joint binary and N-ary relation reasoning with grounded cues.
Pair then Relation: Pair-Net for Panoptic Scene Graph Generation
From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation
A Fair Ranking and New Model for Panoptic Scene Graph Generation
OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models
Panoptic scene graph generation with semantics-prototype learning
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
HiLo: Exploiting high low frequency relations for unbiased panoptic scene graph generation
Haystack: A Panoptic Scene Graph Dataset to Evaluate Rare Predicate Classes
Panoptic Scene Graph Generation
Segmentation-grounded Scene Graph Generation
Deep Generative Probabilistic Graph Neural Networks for Scene Graph Generation

Spatio-Temporal (Video) Scene Graph Generation

Spatio-Temporal (Video) Scene Graph Generation, a.k.a, dynamic scene graph generation, aims to provide a detailed and structured interpretation of the whole scene by parsing an event into a sequence of interactions between different visual entities. It ususally involves two subtasks:

Scene graph detection: aims to generate scene graphs for given videos, comprising detection results of subject-object pari and the associatde predicates. The localization of object prediction is considered accurate when the Intersection over Union (IoU) between the prediction and ground truth is greater than 0.5.
Predicate classification: classifiy predicates for given oracle detection results of subject-object pairs.
Noted
Noted: Evaluation is conducted with two settings: ***With Constraint*** and ***No constraints***. In the former the generated graphs are restricted to at most one edge, i.e., each subject-object pair is allowed only one predicate and in the latter, the graphs can have multiple edges. More details can refer to Metrics.

LLM-based

Non-LLM-based

Audio Scene Graph Generation

3D Scene Graph Generation

Given a 3D point cloud $P \in R^{N×3}$ consisting of $N$ points, we assume there is a set of class-agnostic instance masks $M = {M_1, ..., M_K}$ corresponding to $K$ entities in $P$, 3D Scene Graph Generation aims to map the input 3D point cloud to a reliable semantically structured scene graph $G = {O, R}$. Compared with 2D scene graph Generation, the input of 3D SGG is point cloud.

Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces
Blief Scene Graph

A utility-enhanced extension of a given incomplete scene graph $G^{'}$, by incorporating objects in $C$ (i.e., the object sets relevant for a robotic mission) into $G^{'}$, using the learnt CECI (i.e., Computation of Expectation of finding objects in $C$ based on Correlation Information) information. Belief Scene Graphs enable highlevel reasoning and optimized task planning involving set $C$, which was impossible with the incomplete $G^{'}$.

中文解释
“信念场景图” (Belief Scene Graphs, BSG), 它是对传统3D场景图的扩展，旨在利用局部信息进行高效的高级任务规划。论文的核心在于提出了一种基于图的学习方法，用于计算3D场景图上的“信念”（belief），也称为“期望”（expectation）。这种期望被用来策略性地添加新的节点（称为“盲节点”blind nodes），这些节点与机器人任务相关，但尚未被实际观察到。
- Estimating Commonsense Scene Composition on Belief Scene Graphs
- Belief Scene Graphs: Expanding Partial Scenes with Object through Computation of Expectation
GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding
DynamicGSG: Dynamic 3D Gaussian Scene Graphs for Environment Adaptation
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation
Heterogeneous Graph Learning for Scene Graph Prediction in 3D Point Clouds
EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion
Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling
SGRec3D: Self-Supervised 3D Scene Graph Learning via Object-Level Scene Reconstruction
Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships
CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning
Incremental 3D Semantic Scene Graph Prediction from RGB Sequences
VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud
3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception
Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph prediction
SceneGraphFusion: Incremental 3D Scene Graph Prediction from RGB-D Sequences
Exploiting Edge-Oriented Reasoning for 3D Point-based Scene Graph Analysis
Learning 3D Semantic Scene Graphs from 3D Indoor Reconstructions
3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

4D Scene Graph Gnereation

Textual Scene Graph Generation

Map Space Scene Graph

Can Large Vision Language Models Read Maps like a Human?
Map space scene graph (MSSG) as a indexing data structure for the human readable map.
In this paper, we introduce MapBench—the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate VLMgenerated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes.

Universal Scene Graph Generation

Universal Scene Graph Generation
A novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs.
Scene graph (SG) representations can neatly and efficiently describe scene semantics, which has driven sustained intensive research in SG generation. In the real world, multiple modalities often coexist, with different types, such as images, text, video, and 3D data, expressing distinct characteristics. Unfortunately, current SG research is largely confined to single-modality scene modeling, preventing the full utilization of the complementary strengths of different modality SG representations in depicting holistic scene semantics. To this end, we introduce `Universal SG (USG)`, a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes, as shown in Fig. 1. Further, we tailor a niche-targeting USG parser, USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges. We design the USG-Par with modular architecture for end-to-end USG generation, in which we devise an object associator to relieve the modality gap for cross-modal object alignment. Further, we propose a text-centric scene contrasting learning mechanism to mitigate domain imbalances by aligning multimodal objects and relations with textual SGs.

🥝 Scene Graph Application

Image Retrieval

SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval
SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs
SceneGraphLoc addresses the novel problem of localizing a query image in a database of 3D scenes represented as compact multi-modal 3D scene graphs
Composing Object Relations and Attributes for Image-Text Matching
Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval
Image Retrieval using Scene Graphs

Image/Video Caption

Fine-Grained Video Captioning through Scene Graph Consolidation
The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Introducing new dataset GBC10M
Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labelled graph structure, with nodes of various types. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and open-vocabulary detection models, by building a new dataset, GBC10M, gathering GBC annotations for about 10M images of the CC12M dataset
Transforming Visual Scene Graphs to Image Captions
Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment
UNISON: Unpaired Cross-Lingual Image Captioning
Comprehensive Image Captioning via Scene Graph Decomposition
From Show to Tell: A Survey on Deep Learning-based Image Captioning
Image captioning based on scene graphs: A survey

2D Image Generation

Visual Reasoning

FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding
Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning
GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding
A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)
Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing
A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)
Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing
Fine-Grained Video Captioning through Scene Graph Consolidation
STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models
SceneGPT: A Language Model for 3D Scene Understanding
Towards Flexible Visual Relationship Segmentation
A single model that seamlessly integrates Visual relationship understanding has been studied separately in human-object interaction (HOI) detection, scene graph generation (SGG), and referring relationships (RR) tasks.
FleVRS leverages the synergy between text and image modalities, to ground various types of relationships from images and use textual features from vision-language models to visual conceptual understanding.
LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models
R2G: Reasoning to Ground in 3D Scenes
Multi-modal Situated Reasoning in 3D Scenes
Introducing a large-scale multimodal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes
MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios and object modalities within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide both texts, images, and point clouds for situation and question description, aiming to resolve ambiguity in describing situations with single-modality inputs (\eg, texts).
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

Enhanced VLM/MLLM

Semantic Compositions Enhance Vision-Language Contrastive Learning
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
New dataset and New Task (Relation Conversation)
we propose a novel task, termed Relation Conversation (ReC), which unifies the formulation of text generation, object localization, and relation comprehension. Based on the unified formulation, we construct the AS-V2 dataset, which consists of 127K high-quality relation conversation samples, to unlock the ReC capability for Multi-modal Large Language Models (MLLMs).
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
New dataset and a unified vision-language model for open-word panoptic visual recognition and understanding
we propose a new large-scale dataset (AS-1B) for open-world panoptic visual recognition and understanding, using an economical semi-automatic data engine that combines the power of off-the-shelf vision/language models and human feedback. Moreover, we develop a unified vision-language foundation model (ASM) for open-world panoptic visual recognition and understanding. Aligning with LLMs, our ASM supports versatile image-text retrieval and generation tasks, demonstrating impressive zero-shot capability.
Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
Fine-Grained Semantically Aligned Vision-Language Pre-Training
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

Information Extraction

3D Scene Generation

Mitigate Hallucination

Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models
Introducing a benchmark based on scene graph dataset
Specifically, we first provide a systematic definition of relation hallucinations, integrating perspectives from perceptive and cognitive domains. Furthermore, we construct the relation-based corpus utilizing the representative scene graph dataset Visual Genome (VG), from which semantic triplets follow real-world distributions
BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations
Mitigating Hallucination in Visual Language Models with Visual Supervision

Dynamic Environment Guidance

Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models
A Spatial Relationship Aware Dataset for Robotics
Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models
Visual Environment-Interactive Planning for Embodied Complex-Question Answering
FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction
Domain-Conditioned Scene Graphs for State-Grounded Task Planning
SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation
Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation
Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models
Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
Open Scene Graphs for Open World Object-Goal Navigation
VeriGraph: Scene Graphs for Execution Verifiable Robot Planning
LLM-enhanced Scene Graph Learning for Household Rearrangement
household rearrangement
The household rearrangement task involves spotting misplaced objects in a scene and accommodate them with proper places.
Situational Instructions Database: Task Guidance in Dynamic Environments
Situational Instructions Database (SID)
Situational Instructions Database (SID) is a dataset for dynamic task guidance. It contains situationally-aware instructions for performing a wide range of everyday tasks or completing scenarios in 3D environments. The dataset provides step-by-step instructions for these scenarios which are grounded in the context of the situation. This context is defined through a scenario-specific scene graph that captures the objects, their attributes, and their relations in the environment. The dataset is designed to enable research in the areas of grounded language learning, instruction following, and situated dialogue.
RoboHop: Segment-based Topological Map Representation for Open-World Visual Navigation
LLM-Personalize: Aligning LLM Planners with Human Preferences via Reinforced Self-Training for Housekeeping Robots

Privacy-sensitive Object Identification

Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning

Referring Expression Comprehension

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
A triplet-matching objective to fine-tune the vision-language alignment models.
To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instancelevel similarity matrix. Furthermore, to equip VLA models with the ability of relationship nderstanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships

Video Retrieval

A Review and Efficient Implementation of Scene Graph Generation Metricsl

Video Generation

VISAGE: Video Synthesis using Action Graphs for Surgery

Automated Driving & Intelligent Transport Systems

Traffic scene graph is constructed thought spatial location rule. For example, using BEV rules to derive triplets like ego-vehicle, isin, lane-3, ego-vehicle, to right of, pedestrian-1, and ego-vehicle, very near, pedestrian-2.

🤶 Evaluation Metrics

🐱‍🚀 Miscellaneous

Toolkit

Here, we provide some toolkits for parsing scene graphs or other useful tools for referencess.

Stanford Scene Graph Parser
SceneGraphParser
FactualSceneGraph
Scene-Graph-Benchmark.pytorch
SGG-Benchmark🙆‍♀️👈

A new benchmark for the task of Scene Graph Generation
This new codebase provides an up-to-date and easy-to-run implementation of common approaches in the filed of Scene Graph Generation. Welcome to have a try and contribute to this codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
assets		assets
README.md		README.md

ChocoWu/Awesome-Scene-Graph-Generation

Folders and files

Latest commit

History

Repository files navigation

Awesome-Scene-Graph-Generation

📣 News

🔍 First Topic:

🎨 Introduction

📕 Table of Contents

🌷 Scene Graph Datasets

🍕 Scene Graph Generation

2D (Image) Scene Graph Generation

LLM-based

Non-LLM-based

Panoptic Scene Graph Generation

Spatio-Temporal (Video) Scene Graph Generation

LLM-based

Non-LLM-based

Audio Scene Graph Generation

3D Scene Graph Generation

4D Scene Graph Gnereation

Textual Scene Graph Generation

Map Space Scene Graph

Universal Scene Graph Generation

🥝 Scene Graph Application

Image Retrieval

Image/Video Caption

2D Image Generation

Visual Reasoning

Enhanced VLM/MLLM

Information Extraction

3D Scene Generation

Mitigate Hallucination

Dynamic Environment Guidance

Privacy-sensitive Object Identification

Referring Expression Comprehension

Video Retrieval

Video Generation

Automated Driving & Intelligent Transport Systems

🤶 Evaluation Metrics

🐱‍🚀 Miscellaneous

Toolkit

Workshop

Survey

Insteresting Works

⭐️ Star History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Packages