We introduce PixCuboid, an optimization-based approach for cuboid-shaped room layout estimation, which is based on multi-view alignment of dense deep features.
This repository contains the official implementation of the paper PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment, to be presented at the ICCV 2025 Workshop on Large Scale Cross Device Localization.
Project page: https://ghanning.github.io/PixCuboid/
PixCuboid is built upon the excellent PixLoc code base. The PixLoc master
branch is available in this repository under the name pixloc
.
Install PixCuboid in editable mode as follows:
git clone https://github.com/ghanning/PixCuboid.git
cd PixCuboid/
virtualenv venv
source venv/bin/activate
pip install -e .
Running the demo notebooks requires some extra dependencies that can be installed with:
pip install -e .[extra]
Download the ScanNet++ and 2D-3D-Semantics datasets from their respective web sites and unpack into a subdirectory named "datasets". The expected directory structure is shown below.
.
└── datasets
├── 2d3ds
│ ├── area_1
│ ├── area_2
│ ├── area_3
│ ├── area_4
│ ├── area_5a
│ ├── area_5b
│ └── area_6
└── scannetpp
├── data
├── metadata
└── splits
Note: We only use ScanNet++ to train PixCuboid, but provide code to run the room layout estimation also on 2D-3D-Semantics.
Use the ScanNet++ Toolbox to undistort the DSLR fisheye images by following the instructions here.
Note: As of April 30, 2025, undistorted DSLR images are included in the ScanNet++ dataset and this step can thus be skipped.
Render depth maps for the undistorted DSLR images using the render-undistorted
branch in my fork of the ScanNet++ Toolbox as described here, but set render_undistorted
to True
.
Run our preprocessing script to find the 2D-3D point correspondences used in training:
python -m pixloc.pixlib.preprocess_scannetpp
Split the panorama images into perspective views as detailed here.
While line segments are not required to train PixCuboid they improve its performance at inference time. To extract line segments with DeepLSD first install it with
pip install -e .[deeplsd]
then download the pre-trained weights
mkdir weights
wget https://cvg-data.inf.ethz.ch/DeepLSD/deeplsd_md.tar -O weights/deeplsd_md.tar
and run the extraction for ScanNet++ and 2D-3D-Semantics:
./scripts/line_segments_scannetpp.sh
./scripts/line_segments_2d3ds.sh
Alternatively, you can download the line segments for ScanNet++ from here (665 MiB) and unpack them with the command
unzip line_segments_scannetpp.zip -d datasets/scannetpp
Similarly, the line segments for 2D-3D-Semantics are available here (8 MiB). Unzip with
unzip line_segments_2d3ds.zip -d datasets/2d3ds
Training is done in two stages. First the edge detector is pre-trained by running:
python -m pixloc.pixlib.train --conf pixloc/pixlib/configs/pretrain_pixcuboid_scannetpp.yaml pixcuboid_scannetpp_pretrain
Next the full network is trained, with weights initialized from the previous stage:
python -m pixloc.pixlib.train --conf pixloc/pixlib/configs/train_pixcuboid_scannetpp.yaml pixcuboid_scannetpp train.load_experiment=pixcuboid_scannetpp_pretrain
Tip: Pass the --wandb_project <PROJECT>
argument to the training script to log the results to Weights & Biases.
We supply a script to run PixCuboid on each image tuple (ScanNet++) or space (2D-3D-Semantics) and output the room layout predictions to a JSON file.
python -m pixloc.run_PixCuboid --experiment pixcuboid_scannetpp --conf pixloc/pixlib/configs/eval_pixcuboid_scannetpp.yaml --split {train,val,test} --output OUTPUT
python -m pixloc.run_PixCuboid --experiment pixcuboid_scannetpp --conf pixloc/pixlib/configs/eval_pixcuboid_2d3ds.yaml --split test --output OUTPUT
The resulting predictions can be evaluated using the code in the MultiViewCuboid repository.
Pre-trained weights for a model trained on ScanNet++ as outlined above can be found here (317 MiB). Extract the checkpoint with
mkdir -p outputs/training && unzip pixcuboid_scannetpp.zip -d outputs/training
Try out PixCuboid on ScanNet++ and 2D-3D-Semantic with the Jupyter notebook demo_PixCuboid.ipynb.
We show how the method can be applied to your own data (e.g. a set of images from a COLMAP reconstruction) in the notebook PixCuboid_COLMAP.ipynb.
Use the BibTeX reference below to cite our work.
@inproceedings{hanning2025pixcuboid,
title={{PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment}},
author={Hanning, Gustav and Åström, Kalle and Larsson, Viktor},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
year={2025},
}
In addition, please consider citing the PixLoc paper:
@inproceedings{sarlin21pixloc,
title={{Back to the Feature: Learning Robust Camera Localization from Pixels to Pose}},
author={Paul-Edouard Sarlin and Ajaykumar Unagar and Måns Larsson and Hugo Germain and Carl Toft and Viktor Larsson and Marc Pollefeys and Vincent Lepetit and Lars Hammarstrand and Fredrik Kahl and Torsten Sattler},
booktitle={CVPR},
year={2021},
}