Skip to content

teal024/FoundationPose-plus-plus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FoundationPose++: Simple Tricks Boost FoundationPose Performance in High-Dynamic Scenes

FoundationPose++ is a real-time 6D pose tracker for highly dynamic scenes. This project is based on FoundationPose, and consists of four main modules: FoundationPose + 2D Tracker + Kalman Filter + Amodal Completion.

Here's the Introduction Video:

FoundationPose++.mp4

For clearer video and to join the discussion community, please see RedNote


Motivation

I am interning at a robotics company, and my first task is to develop an automatic annotation tool for a robot manipulation dataset. I need to solve for the 6D pose of target objects, and the first thing that came to mind was FoundationPose. However, FoundationPose performs poorly in tracking the 6D pose of objects in high-dynamic scenes.

I believe the reason is that FoundationPose's tracking is a 'pseudo-tracking' method, where for each new frame, it uses the 6D pose solution from the previous frame as the initial solution for optimization. So, why not provide it with a real tracking method?

FoundationPose Lego_FoundationPose


Method

For the 6 degrees of freedom in 6D pose (x, y, z, roll, pitch, yaw), I use common 2D trackers like Cutie, Samurai, OSTrack, etc., for xy. For z, I directly take the depth at (x, y). For (roll, pitch, yaw), I use a Kalman Filter for tracking. This is a very simple and engineering-oriented trick, but the final results are outstanding. I named it FoundationPose++.

FoundationPose++_Method


Others

Real-Time

The additional modules in FoundationPose++ do not significantly impact the real-time performance of the original FoundationPose. According to the original paper, FoundationPose is divided into two stages: Initialization and Tracking. Initialization only solves the first frame using a Refinement network and a Ranking network, which involves randomizing a few hundred initial solutions, running each through the Refinement, and then performing a final Ranking to select the highest-ranked solution as the solution for the first frame. For the subsequent tracking process, it doesn't need to initialize hundreds of solutions; it only needs to take the final solution from the previous frame as the initial solution for the next frame, which means only running the Refinement once without Ranking. This allows FoundationPose to achieve speeds of over 30 FPS on a 3090 (running in Python).

In our FoundationPose++, the tracking process also only requires running the Refinement once per frame. The additional time cost mainly comes from the 2D tracker. But you can choose a faster 2D tracker, like OSTrack, which can run at over 100 FPS on a 3090 and is very accurate.

Real_Time

Amodal Completion

To improve occlusion resistance, we considered introducing Amodal Completion. This module might run slowly and have some compability issues, we are still working on the optimization of this module.

News

  • 2025/03/12 πŸŽ‰: We officially release our project containing tracker and kalman filter for public preview. Current code has been tested on both Nvidia RTX4090@Ubuntu20.04 and Nvidia H800@Ubuntu 22.04. If you have any problem using this project, feel free to submit an issue.

Environment Setup

Check install.md to install all the dependencies.


Prepare your testcase data

Your testcase data should be formatted like:

$PROJECT_ROOT/$TESTCASE
└── color
    β”œβ”€β”€ 0.png
    β”œβ”€β”€ 1.png
    └── ...
└── depth
    β”œβ”€β”€ 0.png
    β”œβ”€β”€ 1.png
    └── ...
└── mesh
    β”œβ”€β”€ mesh.obj/mesh.stl/etc.

There should be an RGB image file and a corresponding depth file for each frame, as well as a mesh file of the object, following FoundationPose data format. You can check out FoundationPose_manual if you are not familiar with FoundationPose.


Try our Demo

We provide our demo of lego_20fps in Google Drive: https://drive.google.com/file/d/1oN5IZHKlb06hEol6akwx1ibCiVcJBuuI/view?usp=sharing

The mask of the first frame has been included in the link. You can run the following scripts to check the results.

export TESTCASE="lego_20fps"
cd $PROJECT_ROOT
python src/obj_pose_track.py \
--rgb_seq_path $PROJECT_ROOT/$TESTCASE/color \
--depth_seq_path $PROJECT_ROOT/$TESTCASE/depth \
--mesh_path $PROJECT_ROOT/$TESTCASE/mesh/1x4.stl \
--init_mask_path $PROJECT_ROOT/$TESTCASE/0_mask.png \
--pose_output_path $PROJECT_ROOT/$TESTCASE/pose.npy \
--mask_visualization_path $PROJECT_ROOT/$TESTCASE/mask_visualization \
--bbox_visualization_path $PROJECT_ROOT/$TESTCASE/bbox_visualization \
--pose_visualization_path $PROJECT_ROOT/$TESTCASE/pose_visualization \
--cam_K "[[426.8704833984375, 0.0, 423.89471435546875], [0.0, 426.4277648925781, 243.5056915283203], [0.0, 0.0, 1.0]]" \
--activate_2d_tracker \
--apply_scale 0.01 \
--force_apply_color \
--apply_color "[0, 159, 237]" \
--est_refine_iter 10 \
--track_refine_iter 3

Then you will see the demo, which is the same as the one shown in the Introduction Video above (RedNote):

3.14.mp4

The other video in the Introduction Video cannot be released as a demo because the meshes used in it are the private property of our company. (PsiBotPsiBotη΅εˆζ™Ίθƒ½)


Inference with your own data

Get the object mask of the first frame to initialize the 2D tracker

This process is to get the mask of the first frame, to help FoundationPose better locate the object during tracking. We use a 2-stage method (Qwen-VL + SAM-HQ) as an example to extract the mask, you can use any other tools to get the mask.

Use Qwen-VL to extract the bounding box [OPTIONAL]

We use Qwen-VL to extract the bounding box, you can use any other tools to get it or directly provide bounding box area without running QwenVL using commands like BOUNDING_BOX_POSITION=[640, 419, 190, 37].

# start Qwen-VL webapi
cd $PROJECT_ROOT
python src/WebAPI/qwen2_vl_api.py --weight_path $PROJECT_ROOT/Qwen2-VL/weights &

# use Qwen-VL to get the bbox of the object
cd $PROJECT_ROOT
BOUNDING_BOX_POSITION=$(python src/utils/obj_bbox.py \
    --frame_path $PROJECT_ROOT/$TESTCASE/color/0.png \
    --visualize_path $PROJECT_ROOT/$TESTCASE/0_bbox.png \
    --object_name $DESCRIPTION_OF_THE_OBJECT \
    --reference_img_path $PATH_OF_REFERENCE_IMAGE)

Use SAM-HQ to extract the mask [OPTIONAL]

We use SAM-HQ to extract the mask, you can use any other tools to get it or directly provide it in the path of $PROJECT_ROOT/$TESTCASE/0_mask.png.

# start SAM webapi
python src/WebAPI/hq_sam_api.py --checkpoint_path $PROJECT_ROOT/sam-hq/pretrained_checkpoints/sam_hq_vit_h.pth &

# get the mask of object in the first frame
python src/utils/obj_mask.py  \
    --frame_path $PROJECT_ROOT/$TESTCASE/color/0.png \
    --bbox_xywh "$BOUNDING_BOX_POSITION" \
    --output_mask_path $PROJECT_ROOT/$TESTCASE/0_mask.png

$DESCRIPTION_OF_THE_OBJECT: the description of an object to help QwenVL anchor box positions, better in Chinese.

$PATH_OF_REFERENCE_IMAGE: you can provide what the object looks like to help QwenVL anchor box positions more precisely.

6D Pose Track Inference

Run the following script to track 6D Pose, the results will be visualized in $PROJECT_ROOT/pose_visualization.

cd $PROJECT_ROOT
python src/obj_pose_track.py \
--rgb_seq_path $PROJECT_ROOT/$TESTCASE/color \
--depth_seq_path $PROJECT_ROOT/$TESTCASE/depth \
--mesh_path $PROJECT_ROOT/$TESTCASE/mesh/1x4.stl \
--init_mask_path $PROJECT_ROOT/$TESTCASE/0_mask.png \
--pose_output_path $PROJECT_ROOT/$TESTCASE/pose.npy \
--mask_visualization_path $PROJECT_ROOT/$TESTCASE/mask_visualization \
--bbox_visualization_path $PROJECT_ROOT/$TESTCASE/bbox_visualization \
--pose_visualization_path $PROJECT_ROOT/$TESTCASE/pose_visualization \
--activate_2d_tracker \
--activate_kalman_filter \
--kf_measurement_noise_scale 0.05 \
--apply_scale 0.01

Use -h to see the usages of the parameters.

For finer grained kalman filter settings, see kalman_filter_6d.py.

Use force_apply_color and apply_color to select a color for the mesh. Regarding other original FoundationPose parameters, checkout NVlabs/FoundationPose#44 (comment) if you have further problems or get unexpected results.

Citation

Currently we don't have a paper, so you don't need to formally cite us. Still, you can use the default 'Cite this repository' of github to get a BibTex version of citation.

About

Real-Time 6D Pose Tracker in High-Dynamic Scenes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages