Skip to content

Efficient video action recognition using hybrid techniques: combining ORB, SIFT, and deep models like VideoMAE and (2+1)D Conv to reduce data size while maintaining performance.

Notifications You must be signed in to change notification settings

PodYapolskiy/ucf50-orb-sift-video-compression-for-classification

Repository files navigation

Enhancing Action Recognition with Advanced Frame Extraction Techniques

Presentation Slides

Final Report

Project Overview

Video action recognition is a fundamental problem in computer vision with diverse applications such as surveillance, healthcare, sports analytics and etc. This project aims to tackle the challenges of high computational demands and reliance on extensive annotated datasets by proposing a resource-efficient framework for video classification. Our streamlined approach optimizes data processing while preserving the critical features needed for accurate action recognition, making it suitable for real-world, resource-constrained scenarios

Approach and Methodology

Frame Extraction

To reduce video complexity and size, we employ a hybrid frame selection technique:

  • ORB (Oriented FAST and Rotated BRIEF): Efficiently captures key points and motion dynamics.

orb

  • SIFT (Scale-Invariant Feature Transform): Preserves spatial features across frames.

sift1 sift2

Deep Learning Models

We leverage the following advanced architectures:

  • VideoMAE: A transformer-based model pre-trained for self-supervised learning, allowing efficient action recognition.

videomae

  • (2+1)D Convolutions: Combines 2D spatial and 1D temporal convolutions for robust video representation.

conv2plus1D


Dataset

We used UCF50 dataset that represents video action recognition benchmark consisting of 6,618 video clips that cover 50 different human action categories. It was introduced by the University of Central Florida in 2012 to facilitate research in the area of video understanding and human action recognition.

ucf50


Performance Highlights

  • Achieved 50% dataset compression while preserving action recognition accuracy.

compression

  • Reduced training time and computational cost with the hybrid approach.

VideoMAE Evaluation Accuracy by iterations

VideoMAE Evaluation Accuracy by time


Project Structure

Project Root
├── .dvc/
│   ├── .gitignore          # DVC configuration files and cache
│   ├── cache/              # DVC cache directory
│   ├── config              # DVC configuration file
│   └── tmp/                # Temporary files for DVC
├── .dvcignore             # Patterns for files DVC should ignore
├── .gitignore             # Git ignore file
├── data/
│   ├── processed/         # Processed data files
│   ├── raw/               # Raw data files
│   └── README.md          # Documentation for data folder
├── deployment/
│   ├── demo.py            # Script for running the demo
│   └── README.md          # Documentation for deployment
│   └── videos/            # Sample videos for testing
├── models/                # Models storage folder
│
├── notebooks/             # Jupyter notebooks for experimentation and analysis
├── papers/                # Research papers and related documents
├── poetry.lock            # Poetry lock file for dependencies
├── pyproject.toml         # Project configuration file for Poetry
├── README.md              # Main project documentation
├── reports/               # Generated reports and analysis results
├── requirements.txt       # Python dependencies
├── scripts/               # Utility scripts for various tasks
└── src/                   # Source code for the project

Dependencies Setup

To set up and run the project locally, follow these steps:

Note: This project requires Python version 3.11>=.

  1. Clone the repository:

    git clone git@github.com:IVproger/CV_VideoClassification_project.git
    cd CV_VideoClassification_project
  2. Create a virtual environment (optional but recommended):

    python -m venv .venv
    source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`
  3. Install the required dependencies:

    pip install -Ur requirements.txt

Project Artifacts

Main project's artifacts are located on google drive.

References

  1. E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” Proceedings of the IEEE International Conference on Computer Vision, pp. 2564–2571, Nov. 2011. doi: 10.1109/ICCV.2011.6126544.

  2. E. Karami, S. Prasad, and M. Shehata, “Image Matching Using SIFT, SURF, BRIEF and ORB: Performance Comparison for Distorted Images,” arXiv preprint, 2017. Available: https://arxiv.org/abs/1710.02726.

  3. A. Hussain, T. Hussain, W. Ullah, and S. W. Baik, “Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos,” Computational Intelligence and Neuroscience, vol. 2022, no. 1, p. 3454167, 2022. doi: 10.1155/2022/3454167.

  4. S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, “SMART Frame Selection for Action Recognition,” arXiv preprint, 2020. Available: https://arxiv.org/abs/2012.10671.

  5. A. Makandar, D. Mulimani, and M. Jevoor, “Preprocessing Step-Review of Key Frame Extraction Techniques for Object Detection in Video,” 2015. Available: https://api.semanticscholar.org/CorpusID:40113593.

  6. M. Mao, A. Lee, and M. Hong, “Deep Learning Innovations in Video Classification: A Survey on Techniques and Dataset Evaluations,” Electronics, vol. 13, p. 2732, Jul. 2024. doi: 10.3390/electronics13142732.

  7. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A Closer Look at Spatiotemporal Convolutions for Action Recognition,” arXiv preprint, 2018. Available: https://arxiv.org/abs/1711.11248.

  8. Z. Tong, Y. Song, J. Wang, and L. Wang, “VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training,” arXiv preprint, 2022. Available: https://arxiv.org/abs/2203.12602.

Team Members

About

Efficient video action recognition using hybrid techniques: combining ORB, SIFT, and deep models like VideoMAE and (2+1)D Conv to reduce data size while maintaining performance.

Topics

Resources

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •  

Languages