This project aims to perform human action recognition using a Vision Transformer (ViT) model fine-tuned on the HMDB (Human Motion Database) dataset. The HMDB dataset includes over 6,800 video clips spanning 51 action categories, such as running, eating, and waving, making it a comprehensive benchmark for human activity recognition. By extracting frames from videos, preprocessing them, and fine-tuning a ViT model, we aim to classify actions accurately with a target accuracy of 90%.
The code for this project is divided into the following steps:
-
Dataset Preprocessing
Extracted frames from each video in the HMDB dataset.
Resized the frames to the input size expected by Vision Transformer (ViT).
Applied data augmentation techniques to improve generalization, such as cropping, flipping, and normalization. -
Loading the Vision Transformer Model
Loaded a pre-trained ViT model suitable for image-based tasks.
Modified the model’s final layers to match the number of classes in the HMDB dataset.
Used high-level libraries such as hugging face etc. -
Setting Up Training Configurations
Chose an appropriate batch size and number of epochs for effective training.
Set a suitable learning rate for fine-tuning the model on the dataset. -
Checkpointing and Early Stopping
Used checkpointing to save the best-performing model during training.
Implemented early stopping based on validation performance to avoid overfitting. -
Model Evaluation
Evaluated the model’s accuracy on the test set and achieved 95% plus accuracy.
Ensure the following are installed in your environment (these are available in Kaggle notebooks by default):
- Python 3.8+
- PyTorch 1.7+
- Hugging Face’s
transformers
library torchvision
pandas
opencv-python
matplotlib
pip install torch torchvision transformers pandas opencv-python matplotlib
-
Dataset Preparation:
- Upload the HMDB dataset to your Kaggle notebook workspace or local environment. Ensure that the dataset is structured with action class folders containing video frames (as per the project requirements).
-
Code Structure:
- Organize your code cells in the following order to run them sequentially:
- Data Preprocessing: Preprocesses frames for model input.
- Model Loading: Loads and customizes the ViT model.
- Training Configuration: Sets batch size, epochs, learning rate, and optimizers.
- Training with Checkpointing and Early Stopping: Saves the best model.
- Evaluation and Visualization: Evaluates model accuracy and displays sample predictions.
- Organize your code cells in the following order to run them sequentially:
-
Directory Setup:
- Ensure that your dataset is organized under
/kaggle/working/hmdb_frames/
(or adjusttrain_data_path
andval_data_path
in the code to match your setup).
- Ensure that your dataset is organized under
-
Running the Code:
- Execute each section in sequence in a Jupyter notebook environment such as Kaggle notebooks.
-
Training the Model:
- Run the training script to start model training. Training progress, including losses and accuracies, will be displayed for each epoch.
- The model will save the best checkpoint based on validation accuracy.
-
Model Evaluation:
- After training, evaluate the model’s performance using the test set.
- Run the visualization code to plot training/validation losses and visualize predictions for selected test samples.
-
Model Saving:
- The trained model will be saved as
hmdb_vit_model.pth
after training completes.
- The trained model will be saved as
-
Dataset Structure:
- The HMDB dataset needs to be organized in a specific structure. If the dataset is not structured correctly, the
HMDBDataset
class may throw errors. Ensure frames are extracted into a parent folder with each action class as a sub-folder.
- The HMDB dataset needs to be organized in a specific structure. If the dataset is not structured correctly, the
-
Memory Usage:
- Fine-tuning the ViT model is memory-intensive, especially on larger datasets. If running out of memory, consider reducing the batch size, image resolution, or model complexity (e.g., using a smaller ViT variant).
-
Frame Extraction Requirements:
- The
cv2.VideoCapture
function may fail on unsupported video formats. If you encounter issues during frame extraction, check that OpenCV is installed correctly and videos are in a compatible format (e.g., .mp4).
- The
-
Validation and Test Set Accuracy:
- Achieving high validation accuracy (90% or above) on the HMDB dataset can be challenging due to its diversity. Fine-tuning parameters like learning rate, batch size, or data augmentation can help improve results.
-
Training Speed:
- Fine-tuning a ViT model on a large dataset like HMDB may take significant time. Consider using a GPU-based environment for faster training, as provided by Kaggle or Google Colab.
- The HMDB dataset creators for the dataset used in this project.
- Hugging Face for providing the Vision Transformer (ViT) model used in this implementation.
This README provides all necessary details for running and troubleshooting the code. Feel free to reach out with any questions or issues related to the code and dataset structure!