Human Action Recognition with Vision Transformer (ViT) on HMDB Dataset

This project aims to perform human action recognition using a Vision Transformer (ViT) model fine-tuned on the HMDB (Human Motion Database) dataset. The HMDB dataset includes over 6,800 video clips spanning 51 action categories, such as running, eating, and waving, making it a comprehensive benchmark for human activity recognition. By extracting frames from videos, preprocessing them, and fine-tuning a ViT model, we aim to classify actions accurately with a target accuracy of 90%.

Dataset Preprocessing
 Extracted frames from each video in the HMDB dataset.
 Resized the frames to the input size expected by Vision Transformer (ViT).
 Applied data augmentation techniques to improve generalization, such as cropping, flipping, and normalization.
Loading the Vision Transformer Model
 Loaded a pre-trained ViT model suitable for image-based tasks.
 Modified the model’s final layers to match the number of classes in the HMDB dataset.
 Used high-level libraries such as hugging face etc.
Setting Up Training Configurations
 Chose an appropriate batch size and number of epochs for effective training.
 Set a suitable learning rate for fine-tuning the model on the dataset.
Checkpointing and Early Stopping
 Used checkpointing to save the best-performing model during training.
 Implemented early stopping based on validation performance to avoid overfitting.
Model Evaluation
 Evaluated the model’s accuracy on the test set and achieved 95% plus accuracy.

Prerequisites

Ensure the following are installed in your environment (these are available in Kaggle notebooks by default):

Python 3.8+
PyTorch 1.7+
Hugging Face’s transformers library
torchvision
pandas
opencv-python
matplotlib

pip install torch torchvision transformers pandas opencv-python matplotlib

Setup Instructions

Dataset Preparation:
- Upload the HMDB dataset to your Kaggle notebook workspace or local environment. Ensure that the dataset is structured with action class folders containing video frames (as per the project requirements).
Code Structure:
- Organize your code cells in the following order to run them sequentially:
  - Data Preprocessing: Preprocesses frames for model input.
  - Model Loading: Loads and customizes the ViT model.
  - Training Configuration: Sets batch size, epochs, learning rate, and optimizers.
  - Training with Checkpointing and Early Stopping: Saves the best model.
  - Evaluation and Visualization: Evaluates model accuracy and displays sample predictions.
Directory Setup:
- Ensure that your dataset is organized under /kaggle/working/hmdb_frames/ (or adjust train_data_path and val_data_path in the code to match your setup).
Running the Code:
- Execute each section in sequence in a Jupyter notebook environment such as Kaggle notebooks.

Usage

Training the Model:
- Run the training script to start model training. Training progress, including losses and accuracies, will be displayed for each epoch.
- The model will save the best checkpoint based on validation accuracy.
Model Evaluation:
- After training, evaluate the model’s performance using the test set.
- Run the visualization code to plot training/validation losses and visualize predictions for selected test samples.
Model Saving:
- The trained model will be saved as hmdb_vit_model.pth after training completes.

Known Issues

Dataset Structure:
- The HMDB dataset needs to be organized in a specific structure. If the dataset is not structured correctly, the HMDBDataset class may throw errors. Ensure frames are extracted into a parent folder with each action class as a sub-folder.
Memory Usage:
- Fine-tuning the ViT model is memory-intensive, especially on larger datasets. If running out of memory, consider reducing the batch size, image resolution, or model complexity (e.g., using a smaller ViT variant).
Frame Extraction Requirements:
- The cv2.VideoCapture function may fail on unsupported video formats. If you encounter issues during frame extraction, check that OpenCV is installed correctly and videos are in a compatible format (e.g., .mp4).
Validation and Test Set Accuracy:
- Achieving high validation accuracy (90% or above) on the HMDB dataset can be challenging due to its diversity. Fine-tuning parameters like learning rate, batch size, or data augmentation can help improve results.
Training Speed:
- Fine-tuning a ViT model on a large dataset like HMDB may take significant time. Consider using a GPU-based environment for faster training, as provided by Kaggle or Google Colab.

Acknowledgments

The HMDB dataset creators for the dataset used in this project.
Hugging Face for providing the Vision Transformer (ViT) model used in this implementation.

This README provides all necessary details for running and troubleshooting the code. Feel free to reach out with any questions or issues related to the code and dataset structure!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
Vision_Transformer_HMDB_Dataset.ipynb		Vision_Transformer_HMDB_Dataset.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Human Action Recognition with Vision Transformer (ViT) on HMDB Dataset

Table of Contents

Project Overview

Prerequisites

Setup Instructions

Usage

Known Issues

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

hisanusman/Vision-Transformer-for-Human-Activity-Recognition

Folders and files

Latest commit

History

Repository files navigation

Human Action Recognition with Vision Transformer (ViT) on HMDB Dataset

Table of Contents

Project Overview

Prerequisites

Setup Instructions

Usage

Known Issues

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages