Yifan Bian, Chuanbo Tang, Li Li, Dong Liu
Our Spatially Embedded Video Codec (SEVC) significantly advances the performance of Neural Video Codecs (NVCs). Furthermore, SEVC possess enhanced robustness for special video sequences while offering additional functionality.
- Large Motions: SEVC can better handle sequences with large motions through a progressive motion augmentation.
- Emerging Objects: Equipped with spatial references, SEVC can better handle sequences with emerging objects in low-delay scenes.
- Fast Decoding: SEVC provides a fast decoding mode to reconstruct a low-resolution video.
- [2025/04/05]: Our paper is selected as a highlight paper [13.5%].
Results comparison (BD-Rate and RD curve) for PSNR. The Intra Period is –1 with 96 frames. The anchor is VTM-13.2 LDB
HEVC_B | MCL-JCV | UVG | USTC-TD | |
---|---|---|---|---|
DCVC-HEM | 10.0 | 4.9 | 1.2 | 27.2 |
DCVC-DC | -10.8 | -13.0 | -21.2 | 11.9 |
DCVC-FM | -11.7 | -12.5 | -24.3 | 23.9 |
SEVC (ours) | -17.5 | -27.7 | -33.2 | -12.5 |

- Our SEVC can get better reconstructed MVs on the decoder side in large motion sequences. Here, we choose RAFT as the pseudo motion label.
- Spatial references augment the context for frame coding. For those emerging objects, which do not appear in previous frames, SEVC gives a better description in deep contexts.
This implementation of SEVC is based on DCVC-DC and CompressAI. Please refer to them for more information.
1. Install the dependencies
conda create -n $YOUR_PY38_ENV_NAME python=3.8
conda activate $YOUR_PY38_ENV_NAME
conda install pytorch==1.10.0 torchvision==0.11.0 cudatoolkit=11.3 -c pytorch
pip install pytorch_ssim scipy matplotlib tqdm bd-metric pillow pybind11
2. Prepare test datasets
For testing the RGB sequences, we use FFmpeg to convert the original YUV 420 data to RGB data.
A recommended structure of the test dataset is like:
test_datasets/
├── HEVC_B/
│ ├── BQTerrace_1920x1080_60/
│ │ ├── im00001.png
│ │ ├── im00002.png
│ │ ├── im00003.png
│ │ └── ...
│ ├── BasketballDrive_1920x1080_50/
│ │ ├── im00001.png
│ │ ├── im00002.png
│ │ ├── im00003.png
│ │ └── ...
│ └── ...
├── HEVC_C/
│ └── ... (like HEVC_B)
└── HEVC_D/
└── ... (like HEVC_C)
3. Compile the arithmetic coder
If you need real bitstream writing, please compile the arithmetic coder using the following commands.
On Windows
cd src
mkdir build
cd build
conda activate $YOUR_PY38_ENV_NAME
cmake ../cpp -G "Visual Studio 16 2019" -A x64
cmake --build . --config Release
On Linux
sudo apt-get install cmake g++
cd src
mkdir build
cd build
conda activate $YOUR_PY38_ENV_NAME
cmake ../cpp -DCMAKE_BUILD_TYPE=Release
make -j
1. Evaluation
Run the following command to evaluate the model and generate a JSON file that contains test results.
python test.py --rate_num 4 --test_config ./config_F96-IP-1.json --cuda 1 --worker 1 --output_path output.json --i_frame_model_path ./ckpt/cvpr2023_i_frame.pth.tar --p_frame_model_path ./ckpt/cvpr2025_p_frame.pth.tar
- We use the same Intra model as DCVC-DC.
cvpr2023_i_frame.pth.tar
can be downloaded from DCVC-DC. - Our
cvpr2025_p_frame.pth.tar
can be downloaded from CVPR2025-SEVC.cvpr2023_i_frame.pth.tar
is also available here.
Put the model weights into the ./ckpt
directory and run the above command.
Our model supports variable bitrate. Set different i_frame_q_indexes
and p_frame_q_indexes
to evaluate different bitrates.
2. Real Encoding/Decoding
If you want real encoding/decoding, please use the encoder/decoder script as follows:
Encoding
python encoder.py -i $video_path -q $q_index --height $video_height --width $video_width --frames $frame_to_encode --ip -1 --fast $fast_mode -b $bin_path --i_frame_model_path ./ckpt/cvpr2023_i_frame.pth.tar --p_frame_model_path ./ckpt/cvpr2025_p_frame.pth.tar
$video_path
: input video path | For PNG files, it should be a directory.$q_index
: 0-63 | Less value indicates lower quality.$frames
: N frames | Frames to be encoded. Default is set to -1 (all frames).$fast
: 0/1 | 1 indicates openning fast encoding mode. If--fast 1
is used, only a 4x downsampled video will be encoded.
Decoding
python decoder.py -b $bin_path -o $rec_path --i_frame_model_path ./ckpt/cvpr2023_i_frame.pth.tar --p_frame_model_path ./ckpt/cvpr2025_p_frame.pth.tar
- If it is a fast mode, you will only get a 4x downsampled video.
- If it is not a fast mode, you will get two videos: 4x downsampled and full resolution.
3. Temporal Stability
To intuitively verify the temporal stability of the two resolution videos, we provide two reconstruction examples with four bitrates:
- BasketballDrive_1920x1080_50: q1, q2, q3, q4
- RaceHorses_832x480_30: q1, q2, q3, q4
You can find them in examples.
They are stored in rgb24 format. You can use the YUV Player to display them and observe the temporal stability.
Note that: if you are displaying the skim mode rec, do not forget to set the right resolution, which is a quarter of full resolution.
If this repo helped you, a ⭐ star or citation would make my day!
@InProceedings{Bian_2025_CVPR,
author = {Bian, Yifan and Tang, Chuanbo and Li, Li and Liu, Dong},
title = {Augmented Deep Contexts for Spatially Embedded Video Coding},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {2094-2104}
}
If you have any questions, please contact me:
- togelbian@gmail.com (main)
- esakak@mail.ustc.edu.cn (alternative)
This work is licensed under MIT license.
Our work is implemented based on DCVC-DC and CompressAI.