VSTFusion-VO: Monocular Visual Odometry with Video Swin Transformer Multimodal Fusion

This project is based on aofrancani/TSformer-VO with significant architectural modifications. We replace TimeSformer with a Video Swin Transformer (stages 1–3) and introduce early fusion of RGB and pseudo-depth inputs.

⚠️ This repository currently provides inference and evaluation code only.
Training code (e.g., training script, optimizer, and hyperparameter configs) will be released after paper acceptance.

Overview

This project presents VSTFusion-VO, a monocular visual odometry framework that integrates early-stage RGB and pseudo-depth fusion with a video-native transformer backbone.
Our method applies 3D patch embedding to jointly encode multimodal inputs into spatiotemporal tokens, enabling geometric reasoning without relying on external depth sensors.
We adopt the first three stages of the Video Swin Transformer to perform hierarchical 3D spatiotemporal attention, allowing the model to capture both motion dynamics and scene structure.
Together, this design enables robust 6-DoF pose estimation from monocular video and improves resilience to scale ambiguity and texture sparsity.

Key Features

A transformer-based backbone (Video Swin Transformer, stages 1–3) performing hierarchical 3D spatiotemporal attention to model both spatial structure and temporal dynamics.
3D patch embedding of RGB and pseudo-depth sequences to preserve temporal continuity and encode space-time tokens from the input stage.
Fusion of RGB and pseudo-depth embeddings at an early stage to enhance geometric consistency, without relying on ground-truth depth measurements or additional sensor input.
End-to-end pose regression pipeline directly predicting 6-DoF camera poses.
Evaluated on the KITTI Odometry benchmark with consistent improvements over state-of-the-art monocular VO methods, and specifically compared to SWFormer-VO, the method achieves:
- ↓3.59% Translational Error (relative improvement)
- ↓8.76% Absolute Trajectory Error (ATE)
- ↓2.54% Relative Pose Error (RPE)

KITTI Odometry Evaluation

Quantitative results (7-DoF alignment) on selected KITTI sequences:

Sequence	Trans. Error (%)	Rot. Error (°/100m)	ATE (m)	RPE (m)	RPE (°)
01	25.11	5.77	76.28	0.703	0.260
03	14.75	9.20	20.34	0.101	0.221
04	4.84	2.55	3.29	0.085	0.129
05	9.39	4.09	42.31	0.104	0.201
06	10.40	3.69	25.97	0.133	0.179
07	8.20	6.44	19.53	0.102	0.214
10	8.65	3.45	14.33	0.117	0.241

Evaluation was performed using kitti-odom-eval with 7-DoF alignment.

Abstract

VSTFusion-VO is a Swin-based monocular visual odometry framework that integrates RGB and depth information through early-stage fusion. The model leverages a Video Swin Transformer as its temporal backbone, enabling hierarchical spatiotemporal representation learning for accurate 6-DoF pose estimation. By embedding geometric information from pseudo-depth at the input level, VSTFusion-VO achieves improved results on the KITTI benchmark, demonstrating the effectiveness of multimodal fusion and video-native transformer design in visual motion estimation.

1. Dataset

Download the KITTI Odometry dataset (grayscale) for training and evaluation.

RGB images are stored in .jpg format.
Use png_to_jpg.py to convert the original .png files.

The depth maps are pseudo-depths generated by Monodepth2,
predicted from grayscale KITTI frames and saved as .jpeg images.

The data structure should be as follows:

TSformer-VO/
└── data/
    ├── sequences_jpg/
    │   ├── 00/
    │   │   └── image_0/
    │   │       ├── 000000.jpg           # RGB image
    │   │       ├── 000000_disp.jpeg     # Depth map (Monodepth2)
    │   │       ├── 000001.jpg
    │   │       ├── 000001_disp.jpeg
    │   │       └── ...
    │   ├── 01/
    │   └── ...
    └── poses/
        ├── 00.txt
        ├── 01.txt
        └── ...

2. Pre-trained models

Here you find the checkpoints of our trained-models.

Google Drive folder: link to checkpoints in GDrive

3. Setup

Create a virtual environment using Anaconda and activate it:

conda create -n tsformer-vo python==3.8.0
conda activate tsformer-vo

Install dependencies (with environment activated):

pip install -r requirements.txt

4. Usage

4.1. Training

⚠️ Training instructions will be made available after paper acceptance.

4.2. Inference

In predict_poses.py:

Manually set the variables to read the checkpoint and sequences.

Variables	Info
checkpoint_path	String with the path to the trained model you want to use for inference. Ex: checkpoint_path = "checkpoints/Model1"
checkpoint_name	String with the name of the desired checkpoint (name of the .pth file). Ex: checkpoint_name = "checkpoint_model2_exp19"
sequences	List with strings representing the KITTI sequences. Ex: sequences = ["03", "04", "10"]

4.3. Visualize Trajectories

In plot_results.py:

Manually set the variables to the checkpoint and desired sequences, similarly to Inference

5. Evaluation

The evaluation is done with the KITTI odometry evaluation toolbox. Please go to the evaluation repository to see more details about the evaluation metrics and how to run the toolbox.

Citation

If you find this implementation helpful in your work, please consider referencing this repository.
A citation entry will be provided if a related publication becomes available.

References

Code adapted from TimeSformer.

Check out our previous work on monocular visual odometry: DPT-VO

Name		Name	Last commit message	Last commit date
Latest commit History 209 Commits
datasets		datasets
timesformer		timesformer
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VSTFusion.png		VSTFusion.png
build_model.py		build_model.py
plot_results.py		plot_results.py
predict_poses.py		predict_poses.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VSTFusion-VO: Monocular Visual Odometry with Video Swin Transformer Multimodal Fusion

Overview

Key Features

KITTI Odometry Evaluation

Abstract

Contents

1. Dataset

2. Pre-trained models

3. Setup

4. Usage

4.1. Training

4.2. Inference

4.3. Visualize Trajectories

5. Evaluation

Citation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

tongyu0924/VSTFusion-VO

Folders and files

Latest commit

History

Repository files navigation

VSTFusion-VO: Monocular Visual Odometry with Video Swin Transformer Multimodal Fusion

Overview

Key Features

KITTI Odometry Evaluation

Abstract

Contents

1. Dataset

2. Pre-trained models

3. Setup

4. Usage

4.1. Training

4.2. Inference

4.3. Visualize Trajectories

5. Evaluation

Citation

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages