QUEEN (QUantized Efficient ENcoding) is a novel framework for efficient, streamable free-viewpoint video (FVV) representation using dynamic 3D Gaussians. QUEEN enables high-quality dynamic scene capture, drastically reduces model size (to just 0.7 MB per frame), trains in under 5 seconds per frame, and achieves real-time rendering at ~350 FPS.
This repository contains the official implementation for QUEEN, as introduced in the NeurIPS 2024 paper:
QUEEN: QUantized Efficient ENcoding for Streaming Free-viewpoint Videos Sharath Girish, Tianye Li*, Amrita Mazumdar*, Abhinav Shrivastava, David Luebke, Shalini De Mello NeurIPS 2024
For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing.
- 2025/06: Initial code release!
- 2025/03: We showed a live demo of QUEEN in VR and light field displays at GTC San Jose 2025!
- 2024/12: QUEEN was published at NeurIPS 2024.
- 🔥 News
- Contents
- 🔧 Environment Setup
- Data Preparation
- 💻 Training
- 🎥 Rendering and Evaluation
- 📋 Pre-trained Models
- 🎓 Citations
- 🙏 Acknowledgements
We have only tested on Linux environments with CUDA 11.8+ compatible systems.
Our software has some submodules, please clone the repo recursively.
git clone --recurse-submodules git@github.com:NVlabs/queen.git queen
cd queen
# set up some relevant directories
mkdir data
mkdir logs
mkdir output
We suggest using the Dockerfile to reproduce the environment. We also provide a conda environment, b
Please use the Dockerfile for training within a working container.
# create the conda environment
conda env create -f environment.yml
conda activate queen
# install submodules
pip install ./submodules/simple-knn
pip install ./submodules/diff-gaussian-rasterization
pip install ./submodules/gaussian-rasterization-grad
# manually correct bug in timm package
# as in https://github.com/huggingface/pytorch-image-models/issues/1530#issuecomment-2084575852
cp ./maxxvit.py /root/miniconda3/lib/python3.11/site-packages/timm/models/maxxvit.py
This repo uses an older version of timm which requires a patched version of maxxvit.py
, following this issue.
We include this patched file in the repo (maxxvit.py
). You’ll need to overwrite the existing file in your environment’s timm
installation. This step is already included in the Dockerfile and Conda install instructions, but explained in further detail here.
Python version and Conda environment paths can vary, so you must locate the correct destination and copy the file to that location:
python -c "import timm; print(timm.__file__)"
cp maxxvit.py /path/to/timm/models/maxxvit.py
For example, if you’re using Python 3.12 in a Conda env named queen
:
cp maxxvit.py ~/miniconda3/envs/queen/lib/python3.12/site-packages/timm/models/maxxvit.py
For MiDaS, please download their pretrained weights dpt_beit_large_512.pt
from their official repo. We tested with the V3.1 release.
wget -P MiDaS/weights https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_large_512.pt
We assume datasets are organized as follows:
| --- data
| | [dataset_directory]
│ | [scene_name]
│ | cam01
| | images
| | ---0000.png
│ | --- 0001.png
│ | --- ...
│ | cam02
| | images
│ | --- 0000.png
│ | --- 0001.png
│ | --- ...
│ | ...
│ | sparse_
│ | --- cameras.bin
│ | --- images.bin
│ | --- ...
│ | points3D_downsample2.ply
│ | poses_bounds.npy
To generate the points3D_downsample2.ply , please use the multipleviewprogress.sh script from 4DGaussians.
| --- data
| | [dataset_directory]
│ | [scene_name]
│ | camera_0001
| | images_scaled_2
| | ---0000.png
│ | --- 0001.png
│ | --- ...
│ | camera_0001
| | images_scaled_2
│ | --- 0000.png
│ | --- 0001.png
│ | --- ...
│ | ...
│ | colmap
| | sparse
| | 0
| | --- points3d.bin
│ | --- cameras.bin
│ | --- images.bin
If you have a dataset of multi-view videos, please follow the steps to produce the required data:
- organize the videos into folders of images for each camera view
- follow the instructions in the gaussian-splatting codebase to create the COLMAP camera data at the
sparse_
directory
For more information on how datasets are loaded, please see scene/dataset_readers.py
.
You can train a scene by running:
python train.py --config [config_path] -s [source_path] -m [output_name]
For example:
python train.py --config configs/dynerf.yaml -s data/dynerf/coffee_martini -m ./output/coffee_martini_trained
The training script builds on the original 3DGS training script, and, as such, shares many of the same command line arguments. We add new arguments to control compression hyperparameters and 3DGS training hyperparameters for the initial frame and residual frames.
Please see specific configuration files in configs
for examples, and arguments/__init__.py
for the full list of arguments.
Useful Command Line Arguments for train.py
Path to the source directory containing a COLMAP or Synthetic NeRF data set.
Path where the trained model should be stored (output/<random>
by default).
Add this flag to use white background instead of black (default), e.g., for evaluation of NeRF Synthetic dataset.
Order of spherical harmonics to be used (no larger than 3). 3
by default.
Maximum number of frames to process, 300
by default.
Flag to save rendered images during training.
Flag to save point cloud in PLY format during training.
Flag to save compressed model during training.
Format to save the model in, ply
by default.
Flag to use the _xyz
implementation to reproduce the paper results. Note that using the legacy implementation with --log_compressed
is unsupported.
python render.py -s <path to scene> -m <path to trained model> # Generate renderings
python metrics.py -m <path to trained model> # Compute error metrics on renderings
We also provide a script to render spiral viewpoints.
python render_fvv.py -s <path to scene> -m <path to trained model> # Generate renderings
Example usage:
# Train coffee-martini scene
python train.py --config configs/dynerf.yaml --log_images --log_ply -s data/dynerf/coffee_martini -m ./output/coffee_martini_trained
# Compute metrics
python metrics_video.py -m ./output/coffee_martini_trained
# Render static camera viewpoints and spiral
python render.py -s data/dynerf/coffee_martini -m ./output/coffee_martini_trained
python render_fvv.py --config configs/dynerf.yaml -s data/dynerf/coffee_martini -m ./output/coffee_martini_trained
After training with --log_compressed
to save compressed scenes, we can decode and render frames.
python render_fvv_compressed.py --config configs/dynerf.yaml [any additional config flags] -s <path to scene> -m <path to trained model> # Generate renderings
If you provided any additional command-line arguments during training to adjust the training configuration, you may need to add them to properly generate the renderings, as well.
We provide pre-trained models for the N3DV dataset. The provided models are subject to the Creative Commons — Attribution-NonCommercial-ShareAlike 4.0 International — CC BY-NC-SA 4.0 License terms. The download links for these models are provided in the table below, or on the Models documentation. Please use the "render from compressed" instructions to render from the compressed pkl's.
Note: we provide two versions of our models -- "NeurIPS24" and "Compressed." The "NeurIPS24" version includes only the dense .ply files corresponding to the training runs reported in the paper, while the "Compressed" version contains both dense .ply files and compressed .pkl representations. We identified and corrected a bug before code release that affected the rendering from the compressed .pkl files. This fix has resulted in a slight change in the results (we include a training flag, --use_xyz_legacy
to train with or without the fix).
NeurIPS24 -- Paper Models (download)
Scene | PSNR (dB)↑ | SSIM ↑ | LPIPS ↓ |
---|---|---|---|
coffee-martini | 28.31 | 0.916 | 0.155 |
cook-spinach | 33.31 | 0.955 | 0.134 |
cut-roasted-beef | 33.64 | 0.958 | 0.132 |
sear-steak | 33.95 | 0.962 | 0.125 |
flame-steak | 34.16 | 0.962 | 0.125 |
flame-salmon | 29.17 | 0.923 | 0.144 |
Compressed -- Corrected Compressed Models (download)
Scene | PSNR (dB)↑ | SSIM ↑ | LPIPS ↓ |
---|---|---|---|
coffee-martini | 28.22 | 0.915 | 0.156 |
cook-spinach | 33.33 | 0.956 | 0.134 |
cut-roasted-beef | 33.49 | 0.958 | 0.133 |
sear-steak | 33.94 | 0.962 | 0.126 |
flame-steak | 34.17 | 0.962 | 0.126 |
flame-salmon | 28.93 | 0.922 | 0.145 |
If you find QUEEN useful in your research, please cite:
@inproceedings{
girish2024queen,
title={{QUEEN}: {QU}antized Efficient {EN}coding for Streaming Free-viewpoint Videos},
author={Sharath Girish and Tianye Li and Amrita Mazumdar and Abhinav Shrivastava and David Luebke and Shalini De Mello},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=7xhwE7VH4S}
}
QUEEN builds upon the original gaussian-splatting codebase, and uses the pretrained MiDaS model for depth estimation. We thank the authors for their contributions.