Huihan Wang1* Zhiwen Yang1* Hui Zhang2 Dan Zhao3 Bingzheng Wei4 Yan Xu1 ✉
1BUAA 2THU 3PUMC 4ByteDance
* Equal Contributions. ✉ Corresponding Author.
example.mp4
git clone https://github.com/Yaziwel/FEAT.git
cd FEAT
conda create -n FEAT python=3.10
conda activate FEAT
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Colonoscopic: The dataset provided by paper can be found here. You can directly use the processed video data by Endo-FM without further data processing.
Kvasir-Capsule: The dataset provided by paper can be found here. You can directly use the processed video data by Endo-FM without further data processing.
Please run process_data.py
and process_list.py
to get the split frames and the corresponding list at first.
CUDA_VISIBLE_DEVICES=gpu_id python process_data.py -s ./data/Colonoscopic -t ./data/Colonoscopic_frames
CUDA_VISIBLE_DEVICES=gpu_id python process_list.py -f ./data/Colonoscopic_frames -t ./data/Colonoscopic_frames/train_128_list.txt
The resulted file structure is as follows.
├── data
│ ├── Colonoscopic
│ ├── 00001.mp4
| ├── ...
│ ├── Kvasir-Capsule
│ ├── 00001.mp4
| ├── ...
│ ├── Colonoscopic_frames
│ ├── train_128_list.txt
│ ├── 00001
│ ├── 00000.jpg
| ├── ...
| ├── ...
│ ├── Kvasir-Capsule_frames
│ ├── train_128_list.txt
│ ├── 00001
│ ├── 00000.jpg
| ├── ...
| ├── ...
You can follow the steps below to train FEAT:
bash train_scripts/col/train_col.sh
bash train_scripts/kva/train_kva.sh
You can directly sample the medical videos from the checkpoint model. Here is an example for quick usage for using our pre-trained models:
- Download the pre-trained weights from here and put them to specific path defined in the configs. You can also use huggingface_hub to download the weights. For example, a checkpoint can be download like so:
from huggingface_hub import hf_hub_download
# 4 models supported: FEAT_L_col.pt, FEAT_L_kva.pt, FEAT_S_col.pt and FEAT_S_kva.pt
filepath = hf_hub_download(repo_id="WTHH031230/FEAT", filename="FEAT_L_col.pt")
- Run
sample.py
by the following scripts to customize the various arguments like adjusting sampling steps.
You can follow the steps below to sample a video by using FEAT:
bash sample/col.sh
bash sample/kva.sh
DDP sample:
bash sample/col_ddp.sh
bash sample/kva_ddp.sh
After the DDP sample, there will be more than 3125 videos generated to calculate the metrics.
The metrics we calculated in Colonoscopic dataset are below:
Method | FVD↓ | CD-FVD↓ | FID↓ | IS↑ |
---|---|---|---|---|
StyleGAN-V | 2110.7 | 1032.8 | 226.14 | 2.12 |
LVDM | 1036.7 | 792.9 | 96.85 | 1.93 |
MoStGAN-V | 468.5 | 592.0 | 53.17 | 3.37 |
Endora | 460.7 | 545.3 | 13.41 | 3.90 |
FEAT-S (Ours) | 415.4 | 444.0 | 13.34 | 3.96 |
FEAT-L (Ours) | 351.1 | 397.0 | 12.31 | 4.01 |
Before calculating the metrics in our code, you may need the weights for several models, which can be downloaded from the following links:
- Inception v3 for calculating FID and IS.
- I3D for calculating FVD.
- Videomae for calculating CD-FVD.
You can also simply follow this part of the code in Endora to automatically download models from the internet for metric calculation.
To calculate the metrics, you can follow the steps below to evaluate the model.
## FVD, FID and IS
CUDA_VISIBLE_DEVICES=gpu_id python process_data.py -s /path/to/generated/video -t /path/to/video/frames
cd /path/to/stylegan-v
CUDA_VISIBLE_DEVICES=gpu_id python ./src/scripts/calc_metrics_for_dataset.py \
--fake_data_path /path/to/video/frames \
--real_data_path /path/to/dataset/frames
## CD-FVD
CUDA_VISIBLE_DEVICES=gpu_id python calculate_cdfvd.py
As we follow the work Endora, you can run other methods the same way as how Endora described.
As we follow the work Endora, you can run the downstream task the same way as how Endora described.
Method | Colonoscopic |
---|---|
Supervised-only | 74.5 |
LVDM | 76.2 |
Endora | 87.0 |
FEAT-S (ours) | 89.9 |
FEAT-L (ours) | 91.3 |
Greatly appreciate the tremendous effort for the following projects!
If you find FEAT useful in your research, please consider citing:
@article{wang2025feat,
author = {Huihan Wang and Zhiwen Yang and Hui Zhang and Dan Zhao and Bingzheng Wei and Yan Xu},
title = {FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation},
journal = {arXiv preprint arXiv:2506.04956},
year = {2025}
}