Zhen Xu1,2,* Zhengqin Li1 Zhao Dong1 Xiaowei Zhou2 Richard Newcombe1 Zhaoyang Lv1
1Reality Labs Research, Meta 2Zhejiang University
*Work done during internship at Meta.
Please cite this paper if you find this repository useful.
@inproceedings{xu20254dgt,
title = {4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos},
author = {Xu, Zhen and Li, Zhengqin and Dong, Zhao and Zhou, Xiaowei and Newcombe, Richard and Lv, Zhaoyang},
journal = {arXiv preprint arXiv:2506.08015},
year = {2025}
}Use the automated installation script:
bash tlod/scripts/install.shThis script will interactively guide you through setting up the conda environment and installing all dependencies including PyTorch, flash-attention, and apex.
For detailed installation instructions and troubleshooting, see docs/install.md.
You can find the pretrained model from Hugging Face and download manually via:
# By default, the downloaded model will be saved to checkpoints/4dgt_full.pth
python tlod/download_model.pyYou can also skip this step and it will automatically download it when executing the following commands.
We provide two examples of converting a typical Aria recording in .vrs to the format recognized by 4DGT. For details of the data format being processed, check docs/data.md.
We use the sequence from the Aria Explorer from Aria Everyday Activity as an example. This approach applies to any sequence downloaded from Aria Explorer.
mkdir -p data/aea
cd data/aea
# Put the download URL JSON file into data/aea folder
# The download URL file will be different if you choose a different sequence.
aria_dataset_downloader -c loc3_script3_seq1_rec1_download_urls.json -o . -l allRun the following to process the sequence:
# Process the sequence "loc3_script3_seq1_rec1"
DATA_INPUT_DIR="data/aea/loc3_script3_seq1_rec1" \
DATA_PROCESSED_DIR="data/aea/loc3_script3_seq1_rec1" \
VRS_FILE="recording.vrs" \
bash tlod/scripts/run_vrs_preprocessing.shThe Aria Digital Twin uses synthetic rendering as an evaluation mechanism. Follow these steps to download an example sequence:
# Create a directory for datasets
mkdir -p data/adt-raw
cd data/adt-raw
# Put the download link file "ADT_download_urls.json" in the current directory
# Get the file from: https://facebookresearch.github.io/projectaria_tools/docs/open_datasets/aria_digital_twin_dataset/dataset_download
# Download the sample data
aria_dataset_downloader --cdn_file ADT_download_urls.json --output_folder . --data_types 0 1 2 3 4 5 6 7 8 9 --sequence_names Apartment_release_multiuser_cook_seq141_M1292 Apartment_release_multiskeleton_party_seq114_M1292 Apartment_release_meal_skeleton_seq135_M1292 Apartment_release_work_skeleton_seq137_M1292# Process the synthetic sequence "Apartment_release_multiuser_cook_seq141_M1292"
DATA_INPUT_DIR="data/adt-raw/Apartment_release_multiuser_cook_seq141_M1292" \
DATA_PROCESSED_DIR="data/adt/Apartment_release_multiuser_cook_seq141_M1292" \
VRS_FILE="synthetic_video.vrs" \
bash tlod/scripts/run_vrs_preprocessing.shAfter this, you should have a data/adt/Apartment_release_multiuser_cook_seq141_M1292/synthetic_video/camera-rgb-rectified-600-h1000 folder corresponding to the format discussed above.
We provide a simplified Python interface with cleaner configuration. During execution, each batch waits for the previous one to complete: the model encoder generates Gaussians, the renderer processes them, then saves to disk—all in sequence. Rendering all 128 frames will take some time to complete.
# Run inference on the above AEA sequence
# It will produce the novel view spiral rendering on the chosen timestamps among all the frames.
python -m tlod.run \
data_path=data/aea \
seq_list=loc3_script3_seq1_rec1 \
seq_data_root=recording/camera-rgb-rectified-600-h1000 \
novel_view_timestamps="[1.42222, 2.84444]"
# Run inference on the above ADT sequence
# Frame sample indicates running only on the last 128 frames.
python -m tlod.run \
data_path=data/adt \
seq_list=Apartment_release_multiuser_cook_seq141_M1292 \
seq_data_root=synthetic_video/camera-rgb-rectified-600-h1000 \
frame_sample="[-128, null, 1]" \
novel_view_timestamps="[1.42222, 2.84444]"To run the inference, make sure the GPU has at least 16GB of available VRAM.
The full evaluation and inference on other datasets is still a work in progress. We will release them in the near future.
We also provide the first stage model, which is trained only using EgoExo4D data following the description in our paper. It does not use level-of-detail design and will predict much denser Gaussians as output. You can use it as a reference using the following command:
python -m tlod.run \
checkpoint=checkpoints/4dgt_1st_stage.pth \
exp_name=exp_4dgt_1st_stage \
config=configs/models/tlod.py \
data_path=data/aea \
seq_list=loc3_script3_seq1_rec1 \
seq_data_root=recording/camera-rgb-rectified-600-h1000 \
novel_view_timestamps="[1.42222, 2.84444]"We provide a simple interactive web-based viewer that renders Gaussians with asynchronous Gaussian generation:
python -m tlod.run_viewer \
data_path=data/aea \
seq_list=loc3_script3_seq1_rec1 \
seq_data_root=recording/camera-rgb-rectified-600-h1000It has a slider to allow you to control the space (frame) and time. Currently, the asynchronous model prediction process may slow down the interactive rendering depending on which GPU you use. We may enhance this in our future plans.
This implementation is Creative Commons licensed, as found in the LICENSE file.
The work built in this repository benefits from the great work in the following open-source projects:
- Project Aria tool: Apache 2.0
- Egocentric Splats: CC-NC, Meta.
- EasyVolcap: MIT
- Nerfview: Apache 2.0
