Yang You, Kai Xiong, Zhening Yang, Zhengxiang Huang, Junwei Zhou, Ruoxi Shi, Zhou Fang, Adam W Harley, Leonidas Guibas, Cewu Lu
PACE (Pose Annotations in Cluttered Environments) is a large-scale benchmark designed to advance pose estimation in challenging, cluttered scenarios. PACE provides comprehensive real-world and simulated datasets for both instance-level and category-level tasks, featuring:
- 55K frames with 258K annotations across 300 videos
- 238 objects from 43 categories (rigid and articulated)
- An innovative annotation system using a calibrated 3-camera setup
- PACESim: 100K photo-realistic simulated frames with 2.4M annotations across 931 objects
We evaluate state-of-the-art algorithms on PACE for both pose estimation and object pose tracking, highlighting the benchmark's challenges and research opportunities.
- PACE rigorously tests the generalization of state-of-the-art methods in complex, real-world environments, enabling exploration and quantification of the 'simulation-to-reality' gap for practical applications.
- Try our latest pose estimator CPPF++ (TPAMI), which achieves state-of-the-art performance on PACE.
- 2024/07/22: PACE v1.1 uploaded to HuggingFace. Benchmark evaluation code released.
- 2024/03/01: PACE v1.0 released.
- Dataset Download
- Dataset Format
- Dataset Visualization
- Benchmark Evaluation
- Annotation Tools
- License
- Citation
Download the dataset from HuggingFace. Unzip all tar.gz
files and place them under dataset/pace
for evaluation. Large files are split into chunks; merge them with, e.g., cat test_chunk_* > test.tar.gz
.
PACE follows the BOP format with the following structure (regex syntax):
camera_pbr.json
models(_eval|_nocs)?
├─ models_info.json
├─ (artic_info.json)?
├─ obj_${OBJ_ID}.ply
model_splits
├─ category
| ├─ ${category}_(train|val|test).txt
| ├─ (train|val|test).txt
├─ instance
| ├─ (train|val|test).txt
(train(_pbr_cat|_pbr_inst)|val(_inst|_pbr_cat)|test)
├─ ${SCENE_ID}
│ ├─ scene_camera.json
│ ├─ scene_gt.json
│ ├─ scene_gt_info.json
│ ├─ scene_gt_coco_det_modal(_partcat|_inst)?.json
│ ├─ depth
│ ├─ mask
│ ├─ mask_visib
│ ├─ rgb
| ├─ (rgb_nocs)?
Key components:
camera_pbr.json
: Camera parameters for PBR rendering; real camera parameters are in each scene'sscene_camera.json
.models(_eval|_nocs)?
: 3D object models.models
contains original scanned meshes;models_eval
has uniformly sampled point clouds for evaluation (e.g., Chamfer distance); all models (except articulated parts, ID 545–692) are recentered and normalized to a unit bounding box.models_nocs
recolors vertices by NOCS coordinates.models_info.json
: Mesh metadata (diameter, bounds, scales in mm), and mapping fromobj_id
to objectidentifier
. Articulated objects have multiple parts, each with a uniqueobj_id
; associations are inartic_info.json
.artic_info.json
: Part information for articulated objects, keyed byidentifier
.obj_${OBJ_ID}.ply
: Mesh file for object${OBJ_ID}
.
model_splits
: Model IDs for train/val/test splits. Instance-level splits share IDs; category-level splits differ per category.train(_pbr_cat|_pbr_inst)|val(_inst|_pbr_cat)|test
: Synthetic and real data for category/instance-level training and validation; real-world test data for both.${SCENE_ID}
: Each scene in a separate folder (e.g.,000011
).scene_camera.json
: Camera parameters.scene_gt.json
: Ground-truth annotations (BOP format).scene_gt_info.json
: Meta info about ground-truth poses (BOP format).scene_gt_coco_det_modal(_partcat|_inst)?.json
: 2D bounding box and instance segmentation in COCO format.scene_gt_coco_det_modal_partcat.json
: Treats articulated parts as separate categories (for category-level evaluation).scene_gt_coco_det_modal_inst.json
: Treats each object instance as a separate category (for instance-level evaluation). Note: There may be more categories than reported in the paper, as some objects appear only in synthetic data.
rgb
: Color images.rgb_nocs
: Normalized object coordinates as RGB (mapped from[-1, 1]
to[0, 1]
), normalized w.r.t. object bounding box. Example normalization:See this paper for disambiguation method.mesh = trimesh.load_mesh(ply_fn) bbox = mesh.bounds center = (bbox[0] + bbox[1]) / 2 mesh.apply_translation(-center) extent = bbox[1] - bbox[0] colors = np.array(mesh.vertices) / extent.max() colors = np.clip(colors + 0.5, 0, 1.)
depth
: 16-bit depth images. Convert to meters by dividing by 10,000 (PBR) or 1,000 (real).mask
: Object masks.mask_visib
: Visible part masks.
A visualization script is provided to display ground-truth pose annotations and rendered 3D models. Run visualizer.ipynb
to generate visualizations like the following:
Unzip all tar.gz
files from HuggingFace and place them under dataset/pace
for evaluation.
- Ensure the
bop_toolkit
submodule is cloned: aftergit clone
, rungit submodule update --init
, or usegit clone --recurse-submodules git@github.com:qq456cvb/PACE.git
. - Place prediction results at
prediction/instance/${METHOD_NAME}_pace-test.csv
(baseline results available here). - Run:
cd eval/instance sh eval.sh ${METHOD_NAME}
- Place prediction results at
prediction/category/${METHOD_NAME}_pred.pkl
(baseline results available here). - Download ground-truth labels in compatible
pkl
format from here and place ateval/category/catpose_gts_test.pkl
. - Run:
cd eval/category sh eval.sh ${METHOD_NAME}
Note: There are more categories (55) in category_names.txt
than reported in the paper, as some categories lack real-world test images. The actual evaluation categories (47) are in category_names_test.txt
(parts are counted separately). Ground-truth class IDs in catpose_gts_test.pkl
use indices 1–55, matching category_names.txt
.
The source code for our annotation tools is organized as follows:
annotation_tool/
├─ inpainting
├─ obj_align
├─ obj_sym
├─ pose_annotate
├─ postprocessing
├─ TFT_vs_Fund
├─ utils
inpainting
: Inpaints markers for more realistic images.obj_align
: Aligns objects to a consistent orientation within categories.obj_sym
: Annotates object symmetry information.pose_annotate
: Main pose annotation program.postprocessing
: Post-processing steps (e.g., marker removal, extrinsics refinement/alignment).TFT_vs_Fund
: Refines 3-camera extrinsics.utils
: Miscellaneous helper functions.
Detailed documentation is coming soon. We are working to make the annotation tools as user-friendly as possible for accurate 3D pose annotation.
MIT license for all contents except:
- Models with IDs 693–1260 are from SketchFab under CC BY. Original posts:
https://sketchfab.com/3d-models/${OBJ_IDENTIFIER}
(find the identifier inmodels_info.json
). - Models 1165 and 1166 are from GrabCAD (identical geometry, different colors). See GrabCAD license.
@misc{you2023pace,
title={PACE: Pose Annotations in Cluttered Environments},
author={You, Yang and Xiong, Kai and Yang, Zhening and Huang, Zhengxiang and Zhou, Junwei and Shi, Ruoxi and Fang, Zhou and Harley, Adam W. and Guibas, Leonidas and Lu, Cewu},
booktitle={European Conference on Computer Vision},
year={2024},
organization={Springer}
}