Skip to content

ChocoWu/Awesome-Scene-Graph-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 

Repository files navigation

Awesome-Scene-Graph-Generation

📣 News

We're excited to introduce a new section: Hot Topics 🔥!

We'll regularly post interesting discussion topics in the Issues tab. If you're interested, feel free to jump in and share your thoughts! These discussions are purely for idea exchange and community engagement

I'll also be collecting and sharing thought-provoking questions related to the future of scene graphs and scene understanding in general. Everyone is welcome to join the conversation!

🔍 First Topic:

  • "Are scene graphs still a good way to represent and understand scenes?"

    Scene graphs are a form of explicit scene representation. But with the rise of implicit scene representations, is this approach still effective? Which representation is more promising moving forward?

Let us know what you think in the discussion thread!

🎨 Introduction

A scene graph is a topological structure representing a scene described in text, image, video, or etc. In this graph, the nodes correspond to object bounding boxes with their category labels and attributes, while the edges represent the pair-wise relationships between objects.


📕 Table of Contents


🌷 Scene Graph Datasets

Dataset Modality Obj. Class BBox Rela. Class Triplets Instances
Visual Phrase Image 8 3,271 9 1,796 2,769
Scene Graph Image 266 69,009 68 109,535 5,000
VRD Image 100 - 70 37,993 5,000
Open Images v7 Image 600 3,290,070 31 374,768 9,178,275
Visual Genome Image 5,996 3,843,636 1,014 2,347,187 108,077
GQA Image 200 - 310 - 3,795,907
VrR-VG Image 1,600 282,460 117 203,375 58,983
UnRel Image - - 18 76 1,071
SpatialSense Image 3,679 - 9 13,229 11,569
SpatialVOC2K Image 20 5,775 34 9,804 2,026
OpenSG Image (panoptic) 133 - 56 - 49K
AUG Image (Overhead View) 76 - 61 - -
STAR Satellite Imagery 48 219,120 58 400,795 31,096
ReCon1M Satellite Imagery 60 859,751 64 1,149,342 21,392
SkySenseGPT Satellite Imagery (Instruction) - - - - -
Traffic Scene Graph Traffic Image 2,266 - 4,272 - 451
ImageNet-VidVRD Video 35 - 132 3,219 100
VidOR Video 80 - 50 - 10,000
Action Genome Video 35 0.4M 25 1.7M 10,000
AeroEye Video (Drone-View) 56 - 384 - 2.2M
PVSG Video (panoptic) 126 - 57 4,587 400
ASPIRe Video(Interlacements) - - 4.5K - 1.5K
Ego-EASG Video(Ego-view) 407 - 235 - -
3D Semantic Scene Graphs (3DSSG) 3D 528 - 39 - 48K
PSG4D 4D 46 - 15 - -
4D-OR 4D(operating room) 12 - 14 - -
MM-OR 4D(operating room) - - - - -
EgoExOR 4D(operating room) 36 - 22 568,235 -
FACTUAL Image, Text 4,042 - 1,607 40,149 40,369
TSG Bench Text - - - 11,820 4,289
DiscoSG-DS Image, Text 4,018 - 2,033 68,478 8,830


🍕 Scene Graph Generation

2D (Image) Scene Graph Generation

There are three subtasks:

  • Predicate classification: given ground-truth labels and bounding boxes of object pairs, predict the predicate label.
  • Scene graph classification: joint classification of predicate labels and the objects' category given the grounding bounding boxes.
  • Scene graph detection: detect the objects and their categories, and predict the predicate between object pairs.

LLM-based

Non-LLM-based

Panoptic Scene Graph Generation

Compared with traditional scene graph, each object is grounded by a panoptic segmentation mask in PSG, achieving a compresensive structured scene representation.

Spatio-Temporal (Video) Scene Graph Generation

Spatio-Temporal (Video) Scene Graph Generation, a.k.a, dynamic scene graph generation, aims to provide a detailed and structured interpretation of the whole scene by parsing an event into a sequence of interactions between different visual entities. It ususally involves two subtasks:

  • Scene graph detection: aims to generate scene graphs for given videos, comprising detection results of subject-object pari and the associatde predicates. The localization of object prediction is considered accurate when the Intersection over Union (IoU) between the prediction and ground truth is greater than 0.5.
  • Predicate classification: classifiy predicates for given oracle detection results of subject-object pairs.
  • NotedNoted: Evaluation is conducted with two settings: ***With Constraint*** and ***No constraints***. In the former the generated graphs are restricted to at most one edge, i.e., each subject-object pair is allowed only one predicate and in the latter, the graphs can have multiple edges. More details can refer to Metrics.

LLM-based

Non-LLM-based

Audio Scene Graph Generation

3D Scene Graph Generation

Given a 3D point cloud $P \in R^{N×3}$ consisting of $N$ points, we assume there is a set of class-agnostic instance masks $M = {M_1, ..., M_K}$ corresponding to $K$ entities in $P$, 3D Scene Graph Generation aims to map the input 3D point cloud to a reliable semantically structured scene graph $G = {O, R}$. Compared with 2D scene graph Generation, the input of 3D SGG is point cloud.

4D Scene Graph Gnereation

Textual Scene Graph Generation

Map Space Scene Graph

  • Can Large Vision Language Models Read Maps like a Human? Paper Star
    Map space scene graph (MSSG) as a indexing data structure for the human readable map.In this paper, we introduce MapBench—the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate VLMgenerated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes.

Universal Scene Graph Generation

  • Universal Scene Graph Generation Paper Project_Page
    A novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs. Scene graph (SG) representations can neatly and efficiently describe scene semantics, which has driven sustained intensive research in SG generation. In the real world, multiple modalities often coexist, with different types, such as images, text, video, and 3D data, expressing distinct characteristics. Unfortunately, current SG research is largely confined to single-modality scene modeling, preventing the full utilization of the complementary strengths of different modality SG representations in depicting holistic scene semantics. To this end, we introduce `Universal SG (USG)`, a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes, as shown in Fig. 1. Further, we tailor a niche-targeting USG parser, USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges. We design the USG-Par with modular architecture for end-to-end USG generation, in which we devise an object associator to relieve the modality gap for cross-modal object alignment. Further, we propose a text-centric scene contrasting learning mechanism to mitigate domain imbalances by aligning multimodal objects and relations with textual SGs.

🥝 Scene Graph Application

Image Retrieval

Image/Video Caption

2D Image Generation

Visual Reasoning

Enhanced VLM/MLLM

Information Extraction

3D Scene Generation

Mitigate Hallucination

Dynamic Environment Guidance

Privacy-sensitive Object Identification

Referring Expression Comprehension

  • Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions Paper Star
    A triplet-matching objective to fine-tune the vision-language alignment models.To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instancelevel similarity matrix. Furthermore, to equip VLA models with the ability of relationship nderstanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships

Video Retrieval

Video Generation

Automated Driving & Intelligent Transport Systems

  • Traffic scene graph is constructed thought spatial location rule. For example, using BEV rules to derive triplets like ego-vehicle, isin, lane-3, ego-vehicle, to right of, pedestrian-1, and ego-vehicle, very near, pedestrian-2.

🤶 Evaluation Metrics


🐱‍🚀 Miscellaneous

Toolkit

Here, we provide some toolkits for parsing scene graphs or other useful tools for referencess.

Workshop

Survey

Insteresting Works

⭐️ Star History

Star History Chart

About

This is a repository for listing papers on scene graph generation and application.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •