We're excited to introduce a new section: Hot Topics 🔥!
We'll regularly post interesting discussion topics in the Issues tab. If you're interested, feel free to jump in and share your thoughts! These discussions are purely for idea exchange and community engagement
I'll also be collecting and sharing thought-provoking questions related to the future of scene graphs and scene understanding in general. Everyone is welcome to join the conversation!
-
"Are scene graphs still a good way to represent and understand scenes?"
Scene graphs are a form of explicit scene representation. But with the rise of implicit scene representations, is this approach still effective? Which representation is more promising moving forward?
Let us know what you think in the discussion thread!
A scene graph is a topological structure representing a scene described in text, image, video, or etc. In this graph, the nodes correspond to object bounding boxes with their category labels and attributes, while the edges represent the pair-wise relationships between objects.
- 🌷 Scene Graph Datasets
- 🍕 Scene Graph Generation
- 🥝 Scene Graph Application
- 🤶 Evaluation Metrics
- 🐱🚀 Miscellaneous
- ⭐️ Star History
Dataset | Modality | Obj. Class | BBox | Rela. Class | Triplets | Instances |
---|---|---|---|---|---|---|
Visual Phrase | Image | 8 | 3,271 | 9 | 1,796 | 2,769 |
Scene Graph | Image | 266 | 69,009 | 68 | 109,535 | 5,000 |
VRD | Image | 100 | - | 70 | 37,993 | 5,000 |
Open Images v7 | Image | 600 | 3,290,070 | 31 | 374,768 | 9,178,275 |
Visual Genome | Image | 5,996 | 3,843,636 | 1,014 | 2,347,187 | 108,077 |
GQA | Image | 200 | - | 310 | - | 3,795,907 |
VrR-VG | Image | 1,600 | 282,460 | 117 | 203,375 | 58,983 |
UnRel | Image | - | - | 18 | 76 | 1,071 |
SpatialSense | Image | 3,679 | - | 9 | 13,229 | 11,569 |
SpatialVOC2K | Image | 20 | 5,775 | 34 | 9,804 | 2,026 |
OpenSG | Image (panoptic) | 133 | - | 56 | - | 49K |
AUG | Image (Overhead View) | 76 | - | 61 | - | - |
STAR | Satellite Imagery | 48 | 219,120 | 58 | 400,795 | 31,096 |
ReCon1M | Satellite Imagery | 60 | 859,751 | 64 | 1,149,342 | 21,392 |
SkySenseGPT | Satellite Imagery (Instruction) | - | - | - | - | - |
Traffic Scene Graph | Traffic Image | 2,266 | - | 4,272 | - | 451 |
ImageNet-VidVRD | Video | 35 | - | 132 | 3,219 | 100 |
VidOR | Video | 80 | - | 50 | - | 10,000 |
Action Genome | Video | 35 | 0.4M | 25 | 1.7M | 10,000 |
AeroEye | Video (Drone-View) | 56 | - | 384 | - | 2.2M |
PVSG | Video (panoptic) | 126 | - | 57 | 4,587 | 400 |
ASPIRe | Video(Interlacements) | - | - | 4.5K | - | 1.5K |
Ego-EASG | Video(Ego-view) | 407 | - | 235 | - | - |
3D Semantic Scene Graphs (3DSSG) | 3D | 528 | - | 39 | - | 48K |
PSG4D | 4D | 46 | - | 15 | - | - |
4D-OR | 4D(operating room) | 12 | - | 14 | - | - |
MM-OR | 4D(operating room) | - | - | - | - | - |
EgoExOR | 4D(operating room) | 36 | - | 22 | 568,235 | - |
FACTUAL | Image, Text | 4,042 | - | 1,607 | 40,149 | 40,369 |
TSG Bench | Text | - | - | - | 11,820 | 4,289 |
DiscoSG-DS | Image, Text | 4,018 | - | 2,033 | 68,478 | 8,830 |
There are three subtasks:
Predicate classification
: given ground-truth labels and bounding boxes of object pairs, predict the predicate label.Scene graph classification
: joint classification of predicate labels and the objects' category given the grounding bounding boxes.Scene graph detection
: detect the objects and their categories, and predict the predicate between object pairs.
-
Compile Scene Graphs with Reinforcement Learning
R1-based model
R1-SGG, a novel framework leveraging visual instruction tuning enhanced by reinforcement learning (RL). The visual instruction tuning stage follows a conventional supervised fine-tuning (SFT) paradigm, i.e., finetuning the model using prompt-response pairs with a cross-entropy loss. For the RL stage, we adopt GRPO, an online policy optimization algorithm, in which an node-level reward and edge-level reward are designed. -
Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection
-
PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks
-
From Data to Modeling: Fully Open-vocabulary Scene Graph Generation
-
Open World Scene Graph Generation using Vision Language Models
-
Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation
-
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations
-
Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms
-
Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency
-
Scene Graph Generation with Role-Playing Large Language Models
-
SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding
-
VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation
-
SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation
-
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
-
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation
-
Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World
-
GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives
-
Less is More: Toward Zero-Shot Local Scene Graph Generation via Foundation Models
-
Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge
🙆♀️👈
-
Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation
-
Navigating the Unseen: Zero-shot Scene Graph Generation via Capsule-Based Equivariant Features
-
A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation
-
CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations
Introduce the concept of parametric relations
To eliminate ambiguous predicate definitions, we introduce the concept of parametric relations. In addition to a traditional predicate label, we store a parameter (e.g. an angle or a distance) that enables a more fine-grained representation. We show how existing models can be adapted to the new parametric scene graph generation task. Additionally, we introduce proto-relations as a novel technique for representing hypothetical relations. Given an anchor object and a predicate, a proto-relation describes the volume or area that another object would need to intersect to fulfill the associated relation with the anchor object. Protorelations can encode information such as "somewhere next to the TV" or "the area behind the sofa". This representation will arguably be useful for agents that use scene graphs as their intermediate knowledge state. -
HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions
-
Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation
-
Taking A Closer Look at Interacting Objects: Interaction-Aware Open Vocabulary Scene Graph Generation
-
Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation
-
RA-SGG: Retrieval-Augmented Scene Graph Generation Framework via Multi-Prototype Learning
-
Taking A Closer Look at Interacting Objects: Interaction-Aware Open Vocabulary Scene Graph Generation
-
UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation
-
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
-
Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction
-
REACT: Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation
-
BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation
-
Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation
-
Adaptive Self-training Framework for Fine-grained Scene Graph Generation
-
Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency
-
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
-
Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation
-
Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction
-
Multi-Granularity Sparse Relationship Matrix Prediction Network for End-to-End Scene Graph Generation
-
Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection
-
Leveraging Predicate and Triplet Learning for Scene Graph Generation
-
DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation
-
HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
-
EGTR: Extracting Graph from Transformer for Scene Graph Generation
-
STAR: A First-Ever Dataset and A Large-Scale Benchmark for Scene Graph Generation in Large-Size Satellite Imagery
-
Improving Scene Graph Generation with Relation Words’ Debiasing in Vision-Language Models
-
Adaptive Visual Scene Understanding: Incremental Scene Graph Generation
-
Ensemble Predicate Decoding for Unbiased Scene Graph Generation
-
ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery
-
RepSGG: Novel Representations of Entities and Relationships for Scene Graph Generation
-
Hierarchical Relationships: A New Perspective to Enhance Scene Graph Generation
-
Improving Scene Graph Generation with Superpixel-Based Interaction Learning
-
Unbiased Scene Graph Generation via Two-stage Causal Modeling
-
Zero-Shot Scene Graph Generation via Triplet Calibration and Reduction
-
Evidential Unvertainty and Diversity Guided Active Learning for Scene Graph Generation
-
Prototype-based Embedding Network for Scene Graph Generation
-
IS-GGT: Iterative Scene Graph Generation With Generative Transformers
-
Learning to Generate Language-supervised and Open-vocabulary Scene Graph using Pre-trained Visual-Semantic Space
-
Fast Contextual Scene Graph Generation with Unbiased Context Augmentation
-
Devil’s on the Edges: Selective Quad Attention for Scene Graph Generation
-
Fine-Grained is Too Coarse: A Novel Data-Centric Approach for Efficient Scene Graph Generation
-
Vision Relation Transformer for Unbiased Scene Graph Generation
-
Compositional Feature Augmentation for Unbiased Scene Graph Generation
-
The Devil Is in the Labels: Noisy Label Correction for Robust Scene Graph Generation
-
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
-
Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation
-
Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning
-
Unbiased Heterogeneous Scene Graph Generation with Relation-Aware Message Passing Neural Network
-
VARSCENE: A Deep Generative Model for Realistic Scene Graph Synthesis
-
Linguistic Structures as Weak Supervision for Visual Scene Graph Generation
-
CogTree: Cognition Tree Loss for Unbiased Scene Graph Generation
-
Learning to Generate Scene Graph from Natural Language Supervision
-
Context-Aware Scene Graph Generation With Seq2Seq Transformers
-
Generative Compositional Augmentations for Scene Graph Prediction
-
GPS-Net: Graph Property Sensing Network for Scene Graph Generation
-
Learning to Compose Dynamic Tree Structures for Visual Contexts
-
Knowledge-Embedded Routing Network for Scene Graph Generation
-
Scene Graph Generation From Objects, Phrases and Region Captions
Compared with traditional scene graph, each object is grounded by a panoptic segmentation mask
in PSG, achieving a compresensive structured scene representation.
-
Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension
R1-enhanced Visual Relation Reasoning
This work introduces a R1-based Unified framework for joint binary and N-ary relation reasoning with grounded cues. -
Pair then Relation: Pair-Net for Panoptic Scene Graph Generation
-
From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation
-
A Fair Ranking and New Model for Panoptic Scene Graph Generation
-
OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models
-
Panoptic scene graph generation with semantics-prototype learning
-
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
-
HiLo: Exploiting high low frequency relations for unbiased panoptic scene graph generation
-
Haystack: A Panoptic Scene Graph Dataset to Evaluate Rare Predicate Classes
-
Deep Generative Probabilistic Graph Neural Networks for Scene Graph Generation
Spatio-Temporal (Video) Scene Graph Generation, a.k.a, dynamic scene graph generation, aims to provide a detailed and structured interpretation of the whole scene by parsing an event into a sequence of interactions between different visual entities. It ususally involves two subtasks:
Scene graph detection
: aims to generate scene graphs for given videos, comprising detection results of subject-object pari and the associatde predicates. The localization of object prediction is considered accurate when the Intersection over Union (IoU) between the prediction and ground truth is greater than 0.5.Predicate classification
: classifiy predicates for given oracle detection results of subject-object pairs.-
Noted
Noted: Evaluation is conducted with two settings: ***With Constraint*** and ***No constraints***. In the former the generated graphs are restricted to at most one edge, i.e., each subject-object pair is allowed only one predicate and in the latter, the graphs can have multiple edges. More details can refer to Metrics.
-
What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?
-
Weakly Supervised Video Scene Graph Generation via Natural Language Supervision
-
Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms
-
DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation
-
Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation
-
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
-
SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos
-
Salient Temporal Encoding for Dynamic Scene Graph Generation
-
SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos
-
Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation
-
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting
-
CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos
-
OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
-
Action Scene Graphs for Long-Form Understanding of Egocentric Videos
-
HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
Summary
Introduce a new dataset which delves into interactivities understanding within visual content by deriving scene graph representations from dense interactivities among humans and objects -
Action Scene Graphs for Long-Form Understanding of Egocentric Videos
-
End-to-End Video Scene Graph Generation With Temporal Propagation Transformer
-
Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs
-
Video Scene Graph Generation from Single-Frame Weak Supervision
-
Prior Knowledge-driven Dynamic Scene Graph Generation with Causal Inference
-
Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs
-
VRDFormer: End-to-End Video Visual Relation Detection with Transformers
-
Dynamic Scene Graph Generation via Anticipatory Pre-training
-
Meta Spatio-Temporal Debiasing for Video Scene Graph Generation
-
Spatial-temporal transformer for dynamic scene graph generation
-
Target adaptive context aggregation for video scene graph generation
Given a 3D point cloud 3D Scene Graph Generation
aims to map the input 3D point cloud to a reliable semantically structured scene graph
-
Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces
-
Blief Scene Graph
A utility-enhanced extension of a given incomplete scene graph
$G^{'}$ , by incorporating objects in$C$ (i.e., the object sets relevant for a robotic mission) into$G^{'}$ , using the learnt CECI (i.e., Computation of Expectation of finding objects in$C$ based on Correlation Information) information. Belief Scene Graphs enable highlevel reasoning and optimized task planning involving set$C$ , which was impossible with the incomplete$G^{'}$ .中文解释
“信念场景图” (Belief Scene Graphs, BSG), 它是对传统3D场景图的扩展,旨在利用局部信息进行高效的高级任务规划。论文的核心在于提出了一种基于图的学习方法,用于计算3D场景图上的“信念”(belief),也称为“期望”(expectation)。这种期望被用来策略性地添加新的节点(称为“盲节点”blind nodes),这些节点与机器人任务相关,但尚未被实际观察到。 -
GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding
-
DynamicGSG: Dynamic 3D Gaussian Scene Graphs for Environment Adaptation
-
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
-
Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation
-
Heterogeneous Graph Learning for Scene Graph Prediction in 3D Point Clouds
-
EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion
-
Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling
-
SGRec3D: Self-Supervised 3D Scene Graph Learning via Object-Level Scene Reconstruction
-
Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships
-
CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning
-
Incremental 3D Semantic Scene Graph Prediction from RGB Sequences
-
VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud
-
3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud
-
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception
-
Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph prediction
-
SceneGraphFusion: Incremental 3D Scene Graph Prediction from RGB-D Sequences
-
Exploiting Edge-Oriented Reasoning for 3D Point-based Scene Graph Analysis
-
Learning 3D Semantic Scene Graphs from 3D Indoor Reconstructions
-
3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera
-
EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding
-
Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
-
MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments
-
RealGraph: A Multiview Dataset for 4D Real-world Context Graph Generation
-
DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement
-
LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study
-
FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing
-
Scene Graph Parsing via Abstract Meaning Representation in Pre-trained Language Models
-
Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval
- Can Large Vision Language Models Read Maps like a Human?
Map space scene graph (MSSG) as a indexing data structure for the human readable map.
In this paper, we introduce MapBench—the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate VLMgenerated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes.
- Universal Scene Graph Generation
A novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs.
Scene graph (SG) representations can neatly and efficiently describe scene semantics, which has driven sustained intensive research in SG generation. In the real world, multiple modalities often coexist, with different types, such as images, text, video, and 3D data, expressing distinct characteristics. Unfortunately, current SG research is largely confined to single-modality scene modeling, preventing the full utilization of the complementary strengths of different modality SG representations in depicting holistic scene semantics. To this end, we introduce `Universal SG (USG)`, a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes, as shown in Fig. 1. Further, we tailor a niche-targeting USG parser, USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges. We design the USG-Par with modular architecture for end-to-end USG generation, in which we devise an object associator to relieve the modality gap for cross-modal object alignment. Further, we propose a text-centric scene contrasting learning mechanism to mitigate domain imbalances by aligning multimodal objects and relations with textual SGs.
-
SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval
-
SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs
SceneGraphLoc addresses the novel problem of localizing a query image in a database of 3D scenes represented as compact multi-modal 3D scene graphs
-
Composing Object Relations and Attributes for Image-Text Matching
-
Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval
-
Fine-Grained Video Captioning through Scene Graph Consolidation
-
The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning
-
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
-
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Introducing new dataset GBC10M
Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labelled graph structure, with nodes of various types. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and open-vocabulary detection models, by building a new dataset, GBC10M, gathering GBC annotations for about 10M images of the CC12M dataset -
Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment
-
Comprehensive Image Captioning via Scene Graph Decomposition
-
From Show to Tell: A Survey on Deep Learning-based Image Captioning
-
SurGrID: Controllable Surgical Simulation via Scene Graph to I mage Diffusion
-
Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation
-
SurGrID: Controllable Surgical Simulation via Scene Graph to I mage Diffusion
-
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations
-
Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming
-
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations
-
SSGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing
-
SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance
-
What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation
-
Joint Generative Modeling of Scene Graphs and Images via Diffusion Models
-
Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs
-
R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion
-
Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation
-
Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion
-
SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis
-
Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Trainin
-
OSCAR-Net: Object-centric Scene Graph Attention for Image Attribution
-
FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding
-
Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning
-
GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding
-
A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)
-
Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing
-
A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)
-
Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing
-
Fine-Grained Video Captioning through Scene Graph Consolidation
-
STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training
-
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models
-
Towards Flexible Visual Relationship Segmentation
A single model that seamlessly integrates Visual relationship understanding has been studied separately in human-object interaction (HOI) detection, scene graph generation (SGG), and referring relationships (RR) tasks.
FleVRS leverages the synergy between text and image modalities, to ground various types of relationships from images and use textual features from vision-language models to visual conceptual understanding. -
LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models
-
Multi-modal Situated Reasoning in 3D Scenes
Introducing a large-scale multimodal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes
MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios and object modalities within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide both texts, images, and point clouds for situation and question description, aiming to resolve ambiguity in describing situations with single-modality inputs (\eg, texts). -
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
-
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering
-
Semantic Compositions Enhance Vision-Language Contrastive Learning
-
Compositional Chain-of-Thought Prompting for Large Multimodal Models
-
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
-
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
New dataset and New Task (Relation Conversation)
we propose a novel task, termed Relation Conversation (ReC), which unifies the formulation of text generation, object localization, and relation comprehension. Based on the unified formulation, we construct the AS-V2 dataset, which consists of 127K high-quality relation conversation samples, to unlock the ReC capability for Multi-modal Large Language Models (MLLMs). -
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
New dataset and a unified vision-language model for open-word panoptic visual recognition and understanding
we propose a new large-scale dataset (AS-1B) for open-world panoptic visual recognition and understanding, using an economical semi-automatic data engine that combines the power of off-the-shelf vision/language models and human feedback. Moreover, we develop a unified vision-language foundation model (ASM) for open-world panoptic visual recognition and understanding. Aligning with LLMs, our ASM supports versatile image-text retrieval and generation tasks, demonstrating impressive zero-shot capability. -
Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
-
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
-
Fine-Grained Semantically Aligned Vision-Language Pre-Training
-
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs
-
M3S: Scene Graph Driven Multi-Granularity Multi-Task Learning for Multi-Modal NER
-
Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling
-
Multimodal Relation Extraction with Efficient Graph Alignment
-
LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization
-
Towards Terrain-Aware Task-Driven 3D Scene Graph Generation in Outdoor Environments
-
MMGDreamer: Mixed-Modality Graph for Geometry-Controllable 3D Indoor Scene Generation
-
Toward Scene Graph and Layout Guided Complex 3D Scene Generation
-
LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation
-
PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation
-
CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image
-
EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion
-
INSTRUCTLAYOUT: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior
-
Compositional 3D Scene Synthesis with Scene Graph Guided Layout-Shape Generation
-
GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs
-
CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion
-
Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using Scene Graphs
-
Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models
Introducing a benchmark based on scene graph dataset
Specifically, we first provide a systematic definition of relation hallucinations, integrating perspectives from perceptive and cognitive domains. Furthermore, we construct the relation-based corpus utilizing the representative scene graph dataset Visual Genome (VG), from which semantic triplets follow real-world distributions -
BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations
-
Mitigating Hallucination in Visual Language Models with Visual Supervision
-
Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models
-
Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models
-
Visual Environment-Interactive Planning for Embodied Complex-Question Answering
-
FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction
-
Domain-Conditioned Scene Graphs for State-Grounded Task Planning
-
SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation
-
Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation
-
Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models
-
Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
-
VeriGraph: Scene Graphs for Execution Verifiable Robot Planning
-
LLM-enhanced Scene Graph Learning for Household Rearrangement
household rearrangement
The household rearrangement task involves spotting misplaced objects in a scene and accommodate them with proper places. -
Situational Instructions Database: Task Guidance in Dynamic Environments
Situational Instructions Database (SID)
Situational Instructions Database (SID) is a dataset for dynamic task guidance. It contains situationally-aware instructions for performing a wide range of everyday tasks or completing scenarios in 3D environments. The dataset provides step-by-step instructions for these scenarios which are grounded in the context of the situation. This context is defined through a scenario-specific scene graph that captures the objects, their attributes, and their relations in the environment. The dataset is designed to enable research in the areas of grounded language learning, instruction following, and situated dialogue. -
RoboHop: Segment-based Topological Map Representation for Open-World Visual Navigation
-
LLM-Personalize: Aligning LLM Planners with Human Preferences via Reinforced Self-Training for Housekeeping Robots
- Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
A triplet-matching objective to fine-tune the vision-language alignment models.
To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instancelevel similarity matrix. Furthermore, to equip VLA models with the ability of relationship nderstanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships
- Traffic scene graph is constructed thought spatial location rule. For example, using BEV rules to derive triplets like
ego-vehicle, isin, lane-3
,ego-vehicle, to right of, pedestrian-1
, andego-vehicle, very near, pedestrian-2
.
-
Hktsg: A hierarchical knowledge-guided traffic scene graph representation learning framework for intelligent vehicles
-
Edge Feature-Enhanced Network for Collision Risk Assessment Using Traffic Scene Graphs
-
Learning from interaction-enhanced scene graph for pedestrian collision risk assessment
-
Toward driving scene understanding: A paradigm and benchmark dataset for ego-centric traffic scene graph representation
-
roadscene2vec: A tool for extracting and embedding road scene-graphs
-
Scene-Graph Augmented Data-Driven Risk Assessment of Autonomous Vehicle Decisions
-
A Review and Efficient Implementation of Scene Graph Generation Metrics
-
Semantic Similarity Score for Measuring Visual Similarity at Semantic Level
Here, we provide some toolkits for parsing scene graphs or other useful tools for referencess.
-
A new benchmark for the task of Scene Graph Generation
This new codebase provides an up-to-date and easy-to-run implementation of common approaches in the filed of Scene Graph Generation. Welcome to have a try and contribute to this codebase.
-
2nd Workshop on Scene Graphs and Graph Representation Learning
-
First ICCV Workshop on Scene Graphs and Graph Representation Learning
[paper_list]