Skip to content

affctivai/RecSal-Net

Repository files navigation

RecSal-Net

This repository provides the official implementation of RecSal-Net, introduced in our paper:

ChaeEun Woo, SuMin Lee, Soo Min Park, and Byung Hyung Kim, “RecSal-Net: Recursive Saliency Network for Video Saliency Prediction,” Neurocomputing, 2025. [pdf] [link]

RecSal-Net is a recursive transformer architecture that integrates a transformer-based encoder with a recursive feature integration mechanism, specifically designed for the task of Video Saliency Prediction (VSP).

Network structure of RecSal-Net

RecSal-Net structure

Fig.1. RecSal-Net structure

The overall architecture of RecSal-Net. (a) The RecSal-Net model, including a transformer-based encoder, recursive blocks, and a decoder. (b) The recursive block, which iteratively refines multi-scale spatiotemporal features.

Prepare the Python virtual environment

Please create an Anaconda virtual environment by:

$ conda create -n RS python=3.8 -y

Activate the virtual environment by:

$ conda activate RS

Install the requirements by:

$ pip3 install -r requirements.txt

Run the code

Please download the pre-trained VST here and the DHF1K dataset here.

Project/
│
├── saved_models/
│   └── RecSalNet.pth
│
├── data/
│   └── DHF1K/
│       ├── train/
│       └── val/
│
├── dataloader.py
├── loss.py
├── model.py
├── swin_transformer.py
├── test.py
├── train.py
├── utils.py
├── requirements.txt
└── swin_small_patch244_window877_kinetics400_1k.pth

You can run the code by:

$ python3 train.py

The results will be saved in a folder named saved_models.

After you finish all the training processes, you can use test.py to generate the predicted saliency maps and compute all evaluation metrics by:

$ python3 test.py

Resurt

Table 1. Quantitative comparison on DHF1K dataset. The best result is marked in bold.

AUC_J↑ SIM↑ s-AUC↑ CC↑ NSS↑
DeepVS 0.856 0.256 0.583 0.344 1.911
ACLNet 0.890 0.315 0.601 0.434 2.354
SalEMA 0.890 0.466 0.667 0.449 2.574
STRA-Net 0.895 0.355 0.663 0.458 2.558
TASED-Net 0.895 0.361 0.712 0.470 2.667
Chen et al. 0.900 0.353 0.680 0.476 2.685
SalSAC 0.896 0.357 0.697 0.479 2.673
UNISAL 0.901 0.390 0.691 0.490 2.776
HD2S 0.908 0.406 0.700 0.503 2.812
ViNet 0.908 0.381 0.729 0.511 2.872
ECANet 0.903 0.385 0.717 0.500 2.814
TSFP-Net 0.912 0.392 0.723 0.517 2.967
STSANet 0.913 0.383 0.723 0.529 3.010
GFNet 0.913 0.379 0.723 0.529 2.995
Ours 0.913 0.414 0.728 0.547 3.135

Table 2. Quantitative comparison on Hollywood-2 dataset. The best result is marked in bold.

AUC_J↑ SIM↑ CC↑ NSS↑
DeepVS 0.887 0.356 0.446 2.313
ACLNet 0.890 0.542 0.623 3.086
SalEMA 0.919 0.487 0.613 3.186
STRA-Net 0.923 0.487 0.662 3.478
TASED-Net 0.918 0.507 0.646 3.302
Chen et al. 0.928 0.537 0.661 3.804
SalSAC 0.931 0.529 0.670 3.356
UNISAL 0.934 0.543 0.673 3.901
HD2S 0.936 0.551 0.670 3.352
ViNet 0.930 0.550 0.693 3.730
ECANet 0.929 0.526 0.673 3.380
TSFP-Net 0.936 0.571 0.711 3.910
STSANet 0.938 0.579 0.721 3.927
GFNet 0.938 0.585 0.719 3.952
Ours 0.938 0.606 0.737 4.061

Table 3. Quantitative comparison on UCF sports dataset. The best result is marked in bold.

AUC_J↑ SIM↑ CC↑ NSS↑
DeepVS 0.870 0.321 0.405 2.089
ACLNet 0.897 0.406 0.510 2.567
SalEMA 0.906 0.431 0.544 2.638
STRA-Net 0.910 0.479 0.593 3.018
TASED-Net 0.899 0.469 0.582 2.920
Chen et al. 0.917 0.494 0.599 3.406
SalSAC 0.926 0.534 0.671 3.523
UNISAL 0.918 0.523 0.644 3.381
HD2S 0.904 0.507 0.604 3.114
ViNet 0.924 0.522 0.673 3.620
ECANet 0.917 0.498 0.636 3.189
TSFP-Net 0.923 0.561 0.685 3.698
STSANet 0.936 0.560 0.705 3.908
GFNet 0.933 0.544 0.694 3.723
Ours 0.933 0.557 0.698 3.769

Cite

Please cite our paper if you use our code in your own work:

@article{woo2025recsal,
  title={RecSal-Net: Recursive Saliency Network for video saliency prediction},
  author={Woo, ChaeEun and Lee, SuMin and Park, Soo Min and Kim, Byung Hyung},
  journal={Neurocomputing},
  pages = {130822},
  year={2025},
  volume={650}
}

Contributors 4

  •  
  •  
  •  
  •  

Languages