RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

🌐 Homepage | 📖 arXiv | 📂 Benchmark | 📊 Evaluation

✨ CVPR 2025 (Oral) ✨

Authors: Chan Hee Song¹, Valts Blukis², Jonathan Tremblay², Stephen Tyree², Yu Su¹, Stan Birchfield²

¹ The Ohio State University ² NVIDIA

🔔News

🔥[2025-04-24]: Released the RoboSpatial data generation pipeline, RoboSpatial-Home dataset, and evaluation script!

Project Components:

This repository contains the code for generating the spatial annotations used in the RoboSpatial dataset.

Benchmark Dataset: 📂 RoboSpatial-Home
Evaluation Script: 📊 RoboSpatial-Eval

Coming up!

Unified data loader supporting BOP datasets and GraspNet dataset. (Turn object pose estimation datasets into spatial QA!)
Support for additional scan datasets like SCRREAM.

RoboSpatial Annotation Generation

This codebase generates rich spatial annotations for 3D scan datasets. While initially built using the EmbodiedScan conventions, it is designed to be extensible to other data formats through custom data loaders (see Data Loader Documentation). It extracts various spatial relationships from image data and associated 3D information, including:

Object Grounding: Locating objects mentioned in text within the image.
Spatial Context: Identifying points in empty space relative to objects (e.g., "in front of the chair").
Spatial Configuration: Describing the relative arrangement of multiple objects (e.g., "the chair is next to the table").
Spatial Compatibility: Determining if an object could fit in a specific location.

The generated annotations are saved in JSON format, one file per image.

Prerequisites

Python Environment: Ensure you have a Python (3.8+) environment set up (e.g., using conda or venv). Required packages can be installed via pip install -r requirements.txt.
Datasets: You need access to the 3D scan datasets you intend to process.

Note: For specific instructions on downloading and setting up the EmbodiedScan dataset, please refer to the guide in data/README.md.

Configuration: The main configuration file (e.g., robospatial/configs/embodiedscan.yaml) needs to be updated with paths relevant to your chosen data loader and dataset:
- data_loading.loader_class: Specifies the Python class for your data loader (e.g., data_loader.embodiedscan_loader.EmbodiedScanLoader).
- Dataset-specific paths (e.g., image_root, format-specific annotation files like embodiedscan_ann). Consult the configuration file and your data loader's requirements. See Data Loader Documentation for more details on adding custom formats.
- data_generation.output_dir: The directory where the generated .annotations.json files will be saved.

Running Annotation Generation

The core script for generating annotations is robospatial/run_generation.py.

Running with Provided Example Data (Recommended First Step):

We provide a small example scene with input annotations and images in the example_data/ directory. This allows you to test the generation pipeline without downloading large datasets.

Navigate to the robospatial directory:
```
cd robospatial
```
Run the generation script:
```
python run_generation.py --config configs/example_dataset.yaml
```
This will process only the example scene defined in example_dataset.yaml and generate the annotation in the example_data/example_qa folder.

Running on Full Datasets:

Once you have confirmed the example works and have downloaded your target datasets:

Configure your data loader: Ensure the data_loading section in your chosen configuration file (e.g., configs/example_dataset.yaml) correctly points to your dataset paths and uses the appropriate loader_class.

Run the script:

cd robospatial
python run_generation.py --config configs/your_chosen_config.yaml

This command will process all scenes found by the data loader using the settings defined in your_chosen_config.yaml.

Command-Line Options:

--config <path>: (Required) Specifies the path to the YAML configuration file.

--scene <dataset/scene_id>: Process only a single specific scene.

python run_generation.py --config configs/embodiedscan.yaml --scene "scannet/scene0191_00"

--image <image_basename>: Process only a single specific image within the specified scene (requires --scene). Useful for debugging.
```
python run_generation.py --config configs/embodiedscan.yaml --scene "scannet/scene0191_00" --image "00090.jpg"
```
--range <start_idx> <end_idx>: Process a specific range of scenes based on their index in the loaded list (inclusive start, inclusive end).
```
python run_generation.py --config configs/embodiedscan.yaml --range 0 10 # Process first 11 scenes
```
--num_workers <int>: Specify the number of parallel worker threads to use for processing scenes. Overrides the num_workers setting in the config file. Defaults to min(os.cpu_count(), 4) if neither is provided.
```
python run_generation.py --config configs/embodiedscan.yaml --num_workers 8
```
--dry-run: Process only the first 5 images of each scene. Useful for quickly testing the pipeline.
```
python run_generation.py --config configs/embodiedscan.yaml --dry-run
```

Visualizing Input/Outputs

Two scripts are provided in the scripts/ directory for visualizing inputs/outputs:

1. Visualizing Input Data (`scripts/visualize_input.py`)

Use this script to check if your input annotations (e.g., 3D bounding boxes from your dataset's original format, after conversion by your data loader) are being loaded and interpreted correctly. It reads the intermediate JSON format produced by the data loader for a single image and overlays the 3D bounding boxes onto the image.

Usage:

python scripts/visualize_input.py \
    --image_path <path_to_specific_image.jpg> \
    --annotation_file <path_to_intermediate_json_for_image>

Replace <path_to_specific_image.jpg> with the direct path to the image file.
Replace <path_to_intermediate_json_for_image> with the path to the JSON file representing the input annotations for that image (this file's location and naming depend on your data loader implementation).

Example using the provided example data:

python scripts/visualize_input.py \
    --image_path example_data/images/example_dataset/example_scene/example_image.jpg \
    --annotation_file example_data/annotations/example_input.json

2. Visualizing Generated Output (`scripts/visualize_output.py`)

Use this script to debug and inspect the spatial relationships generated by run_generation.py. It reads the final .annotations.json file for a specific image and allows you to visualize different types of generated annotations, including object grounding and spatial relationships (context, configuration, compatibility).

Usage:

python scripts/visualize_output.py \
    --image_path <path_to_specific_image.jpg> \
    --annotation_file <path_to_output_dir>/<dataset>/<scene_id>/<image_name>.annotations.json \
    --object_3d_grounding \
    --context

Replace <path_to_specific_image.jpg> with the direct path to the image file.
Replace <path_to_output_dir> with the path used in your configuration's data_generation.output_dir.
Adjust <dataset>, <scene_id>, and <image_name> to match the specific output file you want to visualize.
Include flags like --object_2d_grounding, --object_3d_grounding, --context, --configuration, or --compatibility to select what to visualize. Use the --verbose or -v flag for more detailed output. Refer to the script's internal documentation (--help) for detailed controls and options.

Example using the provided example data (run the generation first):

python scripts/visualize_output.py \
    --image_path example_data/images/example_dataset/example_scene/example_image.jpg \
    --annotation_file example_data/example_qa/example_scene/example_image.jpg.annotations.json \
    --object_3d_grounding \
    --context

Data Loader Documentation

This project supports adding custom data loaders to handle different 3D dataset formats. The configuration file (data_loading.loader_class) specifies which loader to use.

For detailed instructions on the expected interface for a data loader and how to implement your own, please refer to the README within the data loader directory: robospatial/data_loader/README.md

Project Structure

For a detailed explanation of the annotation generation logic and hyperparameters within the spatial_analysis modules, please refer to the robospatial/README.md.

robospatial/: Main source code directory.
- configs/: Contains YAML configuration files (e.g., example_config.yaml).
- data_loader/: Contains modules for loading and interfacing with different 3D datasets. Includes examples like embodiedscan_loader.py and can be extended with custom loaders. See the README in this directory for details.
- spatial_analysis/: Modules performing the core spatial reasoning and annotation generation logic.
- annotation_generator.py: Orchestrates the generation process for a single scene by calling functions from spatial_analysis.
- run_generation.py: Main script to run the annotation generation across datasets/scenes based on configuration.

Output Files

<output_dir>/<dataset>/<scene_id>/<image_name>.annotations.json: The primary output. Contains the generated spatial annotations for a single image.
generation_progress.json: Stores a list of scenes that have been successfully processed. This allows the script to resume if interrupted. Located in the directory where run_generation.py is executed.
generation_stats.json: Contains aggregated statistics about the generated annotations (e.g., counts of each annotation type) overall and per-dataset. Located in the directory where run_generation.py is executed.

Acknowledgements

We thank the authors of EmbodiedScan for providing their unified annotations for various 3D scan datasets, which served as the foundation for this project's data loading capabilities.

Contact

Luke Song: song.1855@osu.edu
NVIDIA internal: Valts Blukis (vblukis@nvidia.com), Jonathan Tremblay (jtremblay@nvidia.com)
Or Github Issues!

Citation

BibTex:

@inproceedings{song2025robospatial,
  author    = {Song, Chan Hee and Blukis, Valts and Tremblay, Jonathan and Tyree, Stephen and Su, Yu and Birchfield, Stan},
  title     = {{RoboSpatial}: Teaching Spatial Understanding to {2D} and {3D} Vision-Language Models for Robotics},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025},
  note      = {Oral Presentation},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

🔔News

RoboSpatial Annotation Generation

Prerequisites

Running Annotation Generation

Visualizing Input/Outputs

1. Visualizing Input Data (`scripts/visualize_input.py`)

2. Visualizing Generated Output (`scripts/visualize_output.py`)

Data Loader Documentation

Project Structure

Output Files

Acknowledgements

Contact

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
example_data		example_data
robospatial		robospatial
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

NVlabs/RoboSpatial

Folders and files

Latest commit

History

Repository files navigation

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

🔔News

RoboSpatial Annotation Generation

Prerequisites

Running Annotation Generation

Visualizing Input/Outputs

1. Visualizing Input Data (scripts/visualize_input.py)

2. Visualizing Generated Output (scripts/visualize_output.py)

Data Loader Documentation

Project Structure

Output Files

Acknowledgements

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

1. Visualizing Input Data (`scripts/visualize_input.py`)

2. Visualizing Generated Output (`scripts/visualize_output.py`)

Packages