A Parameter-Efficient Adaptation of Pre-Trained Models for Action Recognition ...
Wele Gedara Chaminda Bandara, Vishal M Patel
Johns Hopkins University
Accepted at FG'24
Comparison of our Attention Prompt Tuning (APT) for videos action classification with other existing tuning methods: linear probing, adapter tuning, visual prompt tuning (VPT), and full fine-tuning.
Attention Prompt Tuning (APT) injects learnable prompts directly into the MHA unlike VPT.
Setup the virtual conda environment using the environment.yml:
conda env create -f environment.yml
Then activate the conda environment:
conda activate apt
We use VideoMAE pretrianed on Kinetics-400 dataset for our experiments.
The pre-trained models for ViT-Small and ViT-Base backbones can be downloaded from below links:
| Method | Extra Data | Backbone | Epoch | #Frame | Pre-train | 
|---|---|---|---|---|---|
| VideoMAE | no | ViT-S | 1600 | 16x5x3 | checkpoint | 
| VideoMAE | no | ViT-B | 1600 | 16x5x3 | checkpoint | 
If you need other pre-trained models please refer MODEL_ZOO.md.
We conduct experiments on three action recognition datasets: 1) UCF101 2) HMDB51 3) Something-Something-V2.
Please refer DATASETS.md for access to those links and pre-processing steps.
We provide example scripts to run the attention prompt tuning on UCF101, HMDB51, and SSv2 datasets in scripts/ folder.
Inside scripts/ you can find two folders which corresponds to APT finetuning with ViT-Small and ViT-Base architectures.
To fine-tune with APT you just need to execute finetune.sh file -- which will launch the job with distributed training by
For example, to fine-tune ViT-Base on SSv2 with APT, you may run:
sh scripts/ssv2/vit_base/finetune.sh
The finetune.sh looks like this:
# APT on SSv2
OUTPUT_DIR='experiments/APT/SSV2/ssv2_videomae_pretrain_base_patch16_224_frame_16x2_tube_mask_ratio_0.9_e2400/adam_mome9e-1_wd1e-5_lr5se-2_pl2_ps0_pe11_drop10'
DATA_PATH='datasets/ss2/list_ssv2/'
MODEL_PATH='experiments/pretrain/ssv2_videomae_pretrain_base_patch16_224_frame_16x2_tube_mask_ratio_0.9_e2400/checkpoint.pth'
NCCL_P2P_DISABLE=1 OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=0,1,3,4,5,6,7,8 python -m torch.distributed.launch --nproc_per_node=8 \
    run_class_apt.py \
    --model vit_base_patch16_224 \
    --transfer_type prompt \
    --prompt_start 0 \
    --prompt_end 11 \
    --prompt_num_tokens 2 \
    --prompt_dropout 0.1 \
    --data_set SSV2 \
    --nb_classes 174 \
    --data_path ${DATA_PATH} \
    --finetune ${MODEL_PATH} \
    --log_dir ${OUTPUT_DIR} \
    --output_dir ${OUTPUT_DIR} \
    --batch_size 8 \
    --batch_size_val 8 \
    --num_sample 2 \
    --input_size 224 \
    --short_side_size 224 \
    --save_ckpt_freq 10 \
    --num_frames 16 \
    --opt adamw \
    --lr 0.05 \
    --weight_decay 0.00001 \
    --epochs 100 \
    --warmup_epochs 10 \
    --test_num_segment 2 \
    --test_num_crop 3 \
    --dist_eval \
    --pin_mem \
    --enable_deepspeed \
    --prompt_reparam \
    --is_aa \
    --aa rand-m4-n2-mstd0.2-inc1
Here,
- OUTPUT_DIR: place where you wants to save the results (i.e., logs and checkpoints)
- DATA_PATH: path to where the dataset is stored
- MODEL_PATH: path to the downloaded videomae pre-trained model
- specifiy thich gpus (gpu ids) you wants to use for finetuning in CUDA_VISIBLE_DEVICES=...
- nproc_per_nodeis the number of gpus using for fine-tuning
- modelis the vit-base (vit_base_patch16_224) or vit-small (vit_small_patch16_224)
- transfer_typespecifies which finetuning method to use. 'random' means random initialization, 'end2end' means full end-to-end fine tuning, 'prompt' means APT (ours), 'linear' means linear probing
- prompt_startrefers to starting trasnformer block where you add attention prompts. 0 means you start adding learninable prompts from 1st transformer block in vit
- prompt_endrefers to ending trasformer block where you stop adding attention prompts. vit-base / vit-small has 12 transformer blocks. hence 11 here means you add prompts until last trasnformer block
- data_setspecifies the dataset
- 
- all the other parameters are hyperparamters related to apt fine-tuning.
 
If you think this project is helpful, please feel free to leave a star and cite our paper:
@misc{bandara2024attention,
      title={Attention Prompt Tuning: Parameter-efficient Adaptation of Pre-trained Models for Spatiotemporal Modeling}, 
      author={Wele Gedara Chaminda Bandara and Vishal M. Patel},
      year={2024},
      eprint={2403.06978},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}This repocitory is built on top of VideoMAE: https://github.com/MCG-NJU/VideoMAE codebase and we approcite the authors of VideoMAE for making their codebase publically available.

