UAM-FV-VS: Unified Attention Modeling for Efficient Free-Viewing and Visual Search via Shared Representations
This repository provides the official implementation for Unified Attention Modeling for Efficient Free-Viewing and Visual Search via Shared Representations, ICDL 2025.
This work extends the HAT model by introducing a unified attention modeling framework with shared representation for free-viewing and target-present visual search tasks.
Follow HAT installation guide:
-
Create a Conda Environment
conda create -n uam python=3.10 -y conda activate uam
-
Install PyTorch with CUDA 11.8:
python -m pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
-
Install Additional Dependencies:
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git' python -m pip install wget timm pytz
-
Build MSDeformableAttention:
cd ./hat/pixel_decoder/ops sh make.sh
-
Download Pretrained Weights & HAT Checkpoints (Used to train TP branch utilizing some layers pretrained on FV):
cd - python download.py
-
For COCO-Search18 and COCO-FreeView:
-
Download the datasets and place it in
/Datasets/
directory. -
Download the Dataloaders folder containing
.pkl
files for training target present (TP) branch or free-viewing (FV) branch, and place it in the main project directory.
-
-
For our additional collected data to test the model:
- Download the dataset, and place it in
/Datasets/
directory.
- Download the dataset, and place it in
-
The final folder structure should look like this:
/YourProject/ ├── Dataloaders/ │ ├── data_FV_loader/ │ └── data_TP_loader/ └── Datasets/ ├── COCO-Search18 and COCO-Freeview/ │ ├── images/ │ ├── images_with_fixs/ │ ├── semantic_seq_full/ │ ├── bbox_annos.npy │ ├── clusters.npy │ ├── coco_freeview_fixations_512x320.json │ ├── coco_search_fixations_512x320_on_target_allvalid.json │ ├── M2F_R50_MSDeformAttnPixelDecoder.pkl │ ├── M2F_R50.pkl │ ├── resnet50.yaml │ └── scene_label_dict.npy └── extra_dataset/ ├── annotations/ ├── images/ └── README.md
- Download the pretrained weights for different shared_configurations of the unified model,and place it in the
/checkpoints/
directory. - The final folder structure should look like this:
/YourProject/ └── checkpoints/ ├── Final_checkpoints/ │ ├── ES_1_5.pt │ ├── ES_2_4.pt │ ├── ES_3_3.pt │ ├── ES_4_2.pt │ ├── ES_5_1.pt │ └── LS.pt └── HAT_checkpoints/ ├── HAT_FV.pt └── HAT_TP.pt
The following configuration variables are added to config files in ./configs/
to support different shared represenation of training:
Variable Name | Type | Description |
---|---|---|
branch |
String | Determines which branch to train. Options: - TP : target present - FV : free-viewing |
use_HAT_FV_weights |
Boolean | Used to train TP branch utilizing some layer trained on FV. - Set to true to initialize shared layers from HAT FV pretrained weights (set checkpoint to ./checkpoints/HAT_checkpoints/HAT_FV.pt ).- Set to false for resuming training from a saved checkpoint, and set checkpoint to the saved checkpoint path. |
shared_config |
String | Controls shared layer configuration, Options: - None : no shared represenation, train the whole pixel decoder. - LS : all pixel decoder fixed. - ES_5_1 : only last layer task-specific. - ES_4_2 : last two layers task-specific. - ES_3_3 : last three layers task-specific. - ES_2_4 : last four layers task-specific. - ES_1_5 : last five layers task-specific. |
Run the demo code on your test image.
- Run this command:
python train.py --hparams ./configs/coco_freeview_dense_SSL_train.json --dataset-root <COCO_dataset_root>
- Run this command:
python train.py --hparams ./configs/coco_freeview_dense_SSL_eval.json --dataset-root <COCO_dataset_root> --eval-only
- Run this command:
python train.py --hparams ./configs/extradata_config_eval.json --dataset-root <extra_dataset_root> --eval-only
- For training on your own dataset, please follow the detailed instructions provided in the HAT repo.
If you use this repository in your work, please cite the following paper:
@article{mohammed2025unified,
title={Unified Attention Modeling for Efficient Free-Viewing and Visual Search via Shared Representations},
author={Mohammed, Fatma Youssef and Alexis, Kostas},
journal={arXiv preprint arXiv:2506.02764},
year={2025}
}
For questions or support, please open an issue on GitHub or contact the authors directly: