Welcome to the AUSLAN Fingerspelling Real-Time Gesture Recognition project! This repository contains all the code and resources for building a real-time gesture recognition system for AUSLAN (Australian Sign Language) fingerspelling using machine learning and computer vision techniques.
- Project Overview
- Dataset Preparation
- Model Development
- Real-Time Gesture Recognition
- Results
- Installation and Usage
- Project Structure
- Contributing
- License
- Acknowledgements
The goal of this project is to develop a robust real-time gesture recognition system capable of recognizing AUSLAN fingerspelling gestures (letters A-Z). The system leverages machine learning techniques and computer vision to interpret hand gestures captured through a webcam and translate them into corresponding letters.
- Real-time gesture recognition using a webcam.
- Robust to variations in distance from the camera and hand sizes.
- High accuracy achieved through data augmentation and normalization techniques.
- Visual feedback displaying detected gestures and confidence levels.
We used MediaPipe's Holistic model to collect keypoint data from videos of individuals performing AUSLAN fingerspelling gestures. The keypoints include 3D coordinates of the hand landmarks.
-
Recording Gestures:
- Captured multiple sequences (videos) for each letter (A-Z), with each sequence consisting of 30 frames.
- Ensured diversity by recording gestures from different individuals and varying conditions.
-
Keypoint Extraction:
- Used MediaPipe to extract left and right hand landmarks from each frame.
- Stored the extracted keypoints as NumPy arrays (.npy files) for efficient storage and processing.
- Efficiency: Keypoints require significantly less storage space than raw images.
- Focus on Relevant Data: Keypoints capture essential information for gesture recognition, removing background noise.
- Faster Processing: Working with numerical data speeds up training and inference.
To make the model robust to variations in hand size and distance from the camera, we applied several preprocessing steps:
- Centering: Centered hand keypoints around the wrist landmark.
- Scaling: Scaled the keypoints based on the distance from the wrist to the middle finger MCP joint.
- Purpose: Ensures that the model focuses on the relative positions of landmarks rather than absolute positions.
- Handled missing landmarks by filling in zeros to maintain consistent input sizes.
To enhance the model's ability to generalize, we applied data augmentation techniques:
- Noise Injection: Added Gaussian noise to simulate sensor inaccuracies.
- Scaling: Randomly scaled sequences to simulate varying distances from the camera.
- Time Warping: Altered the temporal dynamics to simulate variations in gesture speed.
- Augmentations were applied randomly with a 50% chance.
- The augmented data was combined with the original dataset, effectively doubling the size.
We built a deep learning model using TensorFlow and Keras, focusing on capturing temporal patterns in the gesture sequences.
- Input Layer: Sequences of shape (30, 126), where 126 is the flattened size of the hand keypoints for both hands.
- LSTM Layers:
- Three LSTM layers with 64, 128, and 64 units respectively.
- Captures temporal dependencies in the data.
- Batch Normalization and Dropout:
- Applied after each LSTM layer to prevent overfitting and improve generalization.
- Dense Layers:
- A Dense layer with 64 units and ReLU activation.
- Output layer with softmax activation for classification into 26 classes.
- Used categorical cross-entropy loss and the Adam optimizer with a learning rate of 1e-4.
- Computed to handle class imbalance.
- ModelCheckpoint: Saves the best model based on validation loss.
- ReduceLROnPlateau: Reduces the learning rate when the validation loss plateaus.
- EarlyStopping: Stops training when the validation loss doesn't improve.
- Trained for 50 epochs with a batch size of 32.
- Used 80% of the data for training and 20% for validation.
-
Variations in Distance from the Camera:
- Issue: Model performance degraded when the subject was far from the camera due to changes in keypoint scales.
- Solution: Normalized hand keypoints independently and applied scaling augmentation to simulate different distances.
-
Varying Hand Sizes:
- Issue: Different hand sizes (e.g., children vs. adults) affected model performance.
- Solution: Normalized hand keypoints to make the model invariant to hand size and included diverse hand sizes in the dataset.
The real-time gesture recognition system consists of:
- Video Capture: Captures frames from the webcam using OpenCV.
- Hand Landmark Detection: Uses MediaPipe's Holistic model to detect hand landmarks.
- Preprocessing: Extracts and normalizes keypoints using the same methods as during training.
- Gesture Prediction: Feeds sequences of keypoints into the trained model to predict gestures.
- Visualization: Displays the webcam feed with overlayed landmarks, predicted gestures, and confidence bars.
- Video Recording: Records the real-time detection session for demonstration purposes.
- Sequence Management: Maintains a sliding window of the last 30 frames for prediction.
- Prediction Consistency: Uses a threshold and checks for consistent predictions over multiple frames to reduce noise.
- Visualization: Probability bars show the confidence for each gesture, and detected gestures are displayed on the screen.
- Recording: Utilizes OpenCV's VideoWriter to record the session.
-
Test Accuracy: Achieved a test accuracy of 99%.
-
Real-Time Performance: The model performs robustly in real-time, accurately recognizing gestures despite variations in distance and hand size.
-
Classification Report:
-
Precision, Recall, F1-score:
-
Confusion Matrix:
-
Tranning and Validation Accuracy:
- Python 3.7 or higher
- Required Python packages:
- OpenCV (opencv-python)
- MediaPipe
- TensorFlow
- NumPy
- Matplotlib
- Scikit-learn
- A webcam for real-time detection
- Clone the Repository:
git clone https://github.com/yourusername/auslan-gesture-recognition.git cd auslan-gesture-recognition
- Create a Virtual Environment (Optional but Recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install Dependencies:
pip install -r requirements.txt
-
Download or Collect Data:
- If using pre-collected data, ensure it's placed in the AUSLAN_Data directory with the correct structure.
- To collect your own data, use the data_collection.py script.
-
Train and Real Time Detection: auslan_sign_final_model.ipynb
-
View Recorded Video (Optional): The real-time detection script saves a video named output.avi. You can view this video to see the recorded session.
auslan-gesture-recognition/
├── AUSLAN_Data/ # Directory containing gesture data
│ ├── A/
│ │ ├── 0/
│ │ │ ├── 0.npy
│ │ │ ├── ...
│ │ ├── ...
│ ├── B/
│ ├── ...
├── models/
│ ├── best_model.keras # Saved trained model
├── data_collection.py # Script for data collection
├── train_model.py # Script for training the model
├── real_time_detection.py # Script for real-time gesture recognition
├── requirements.txt # List of required Python packages
├── README.md # Project documentation
└── LICENSE # License information
Thank you for your interest in this project! If you have any questions or suggestions, feel free to open an issue or contact me directly. Contact
- Email: khobragade.vaibhav8@gmail.com
- GitHub: Vaibhav Khobragade