This repository contains the code for the Bionic Robot Hand Demo, a hand gesture recognition system using a bionic robot hand. The system uses a YOLO‐Pose model for hand landmark detection and controls the servomotors of the bionic hand based on the detected gestures.
The demo is designed to run on Astra SL1680, which is a part of the family of Synaptics Astra SoCs. The model itself is quantized to a mixed precision format to optimize performance on the NPU (Neural Processing Unit) accelerator.
demo_setup.mp4
The project consists of two main components: a backend application that runs the inference & controls the robot hand and a frontend application that displays the results in a kiosk window. The communication between the backend and frontend is done via a WebSocket using an established JSON API.
The frontend application is a simple HTML page that displays the results of the inference in real time. It uses WebSocket to receive the JSON payload from the backend and updates the UI accordingly. The UI shows the detected hand landmarks, the bounding box, and game-related information such as the countdown timer and the result of the game.
demo_ui.mp4
The backend application is written in Python, and uses the SynapRT framework for running the YOLO‐Pose model inference. The main loop of the application is designed to continuously capture images from the camera, perform inference using the YOLO‐Pose model, and control the servomotors of the bionic hand based on the detected gestures.
Under the hood, the main function first spins up both an HTTP server and a WebSocket server, then it launches the SynapRT inference pipeline on its own thread, and then enters a tight infinite loop: each cycle it polls the pipeline for new detection results, computes averages of inference time, frame rate, and power usage, feeds the detected hand flexions into the RPS state machine, packages everything into a JSON payload, and broadcasts it over the WebSocket.
When a hand is detected for more than 10 seconds, the system switches to the rock, paper, scissors game mode. The robot hand stops mirroring the user's hand and instead performs a random gesture. The outcome of the game is determined by comparing the user's gesture with the robot's gesture. After the result is displayed, the system returns to the normal operation mode, where the robot hand mimics the user's hand.
The bionic robot hand is driven by 5 servomotors, one for each finger. The servomotors are controlled using a PCA9685 expansions board, which is connected to the Astra SL1680 via an I2C interface.
The hardware control has been abstracted into separate modules, with an increasing level of abstraction. The lowest level is the servo.py
module, which interacts with kernel driver of the PCA9685 servomotor controller via the sysfs interface. The next level is the hand.py
module, which contains the high-level controllers for the servomotors. It takes raw 3D hand landmarks, filters out low-confidence points, computes inter-joint angles via vector math, smooths those angles over a moving window, normalizes them into 0-100% flexion values, and then either drives the servos to mirror the user's hand in real time or snaps to predefined rock-paper-scissors poses and predicts the player's gesture by matching flexions against pose templates.
The overall control flow of each finger is shown in the diagram below.
The HAnd Gesture Recognition Image Dataset, or HaGRID is a 716 GB hand‐gesture recognition image dataset of 552,992 Full HD (1920×1080) RGB frames spanning 18 gesture classes plus an extra no_gesture class (123,589 samples), split by user ID into 92% train (509,323 images) and 8% test (43,669 images) sets over 34,730 unique subjects (ages 18-65) captured indoors under varied lighting (including backlighting) at 0.5-4 m.
All images are annotated in COCO format with per‐hand bounding boxes [x,y,width,height], 21 hand landmarks [x,y], gesture labels, leading_hand and leading_conf fields, and user_id for bespoke splits.
The train set is provided as 18 per‐gesture archives (about 38-41 GB each), the test split as a 60.4 GB image archive plus 27.3 MB annotation file, and there's also a 100‐sample‐per‐gesture subsample (2.5 GB images, 1.2 MB annotations).
The dataset includes joint landmarks for hand pose detection. Each hand has a total of 21 keypoints. They are annotated as follows:
- Wrist
- Thumb (4 points)
- Index finger (4 points)
- Middle finger (4 points)
- Ring finger (4 points)
- Little finger (4 points)
YOLO‐Pose is a heatmap-free, single‐stage extension of the YOLO object detector that performs end-to-end joint detection and 2D multi-person pose estimation in one network. It outputs person bounding boxes along with their associated keypoint coordinates in a single forward pass, eliminating separate heatmap computation and post-processing grouping steps common to two-stage approaches .
While YOLO-Pose was originally developed for 2D multi-person human pose estimation, here it has been adapted via transfer learning and fine-tuned for hand landmark detection using the HaGRID dataset.
The YOLO family spans multiple model sizes, each differing in network depth, width and parameter count to balance accuracy against speed. In embedded or resource-constrained environments, the nano and small variants are preferred to ensure real-time inference with minimal compute and memory overhead.
Model | Depth | Width | Parameters | Use Case |
---|---|---|---|---|
YOLO11n-pose (Nano) | Shallowest network, fewer layers. | Narrower network, fewer channels per layer. | Least number of parameters, making it lightweight and fast. | Suitable for real-time applications on devices with limited compute. |
YOLO11s-pose (Small) | Deeper than YOLO11n-pose, more layers. | Wider network, more channels per layer. | More parameters than YOLO11n-pose, balancing speed and accuracy. | Ideal for applications requiring a balance between performance and efficiency. |
Heatmaps are used to represent the probability distribution of keypoint locations by indicating how likely each pixel is to contain a keypoint. During training the model learns to generate these heatmaps by converting the ground-truth keypoints into target heatmaps and optimizing its parameters to minimize the discrepancy between its predictions and the ground-truth heatmaps.
The cls_loss
in YOLO11-pose models does not classify keypoints but rather handles the classification of detected objects, similar to detection models. The pose_loss
is specifically designed for keypoint localization, ensuring accurate placement of keypoints on the detected objects. The kobj_loss
(keypoint objectness loss) balances the confidence of keypoint predictions, helping the model to distinguish between true keypoints and background noise.
For the project, both nano and small models were trained. However, the small version was selected for deployment due to its superior performance in terms of accuracy, while still maintaining a reasonable inference speed.
The models were benchmarked CPU-based inference on the target hardware to establish baseline performance across various input sizes.
Tensor Size | Load (ms) | Init (ms) | Min (ms) | Median (ms) | Max (ms) | Stddev (ms) | Mean (ms) |
---|---|---|---|---|---|---|---|
640×384 | 162.10 | 609.14 | 558.52 | 574.68 | 590.52 | 9.61 | 574.65 |
320×192 | 174.65 | 198.18 | 145.84 | 146.88 | 166.53 | 5.60 | 149.04 |
Lowering the input size to 320×192 resulted in a significant speedup, with the model being able to process images in about 150 ms. The 640x384 input size resulted in an average inference time of around 574 ms.
The model has been exported in ONNX format, and then quantized to a mixed precision format using the synap
tool. This framework also handles the preprocessing of the input image, which is necessary for the model to work correctly. The preprocessing steps include resizing the input image to a specific size and normalizing the pixel values.
Different combinations of input sizes and quantization formats were tested to find the best performance. The aspect ratio of the input image was chosen, to match the camera resolution.
The following configurations were tested using the synap_cli
tool:
Tensor Size | Quantization | Load (ms) | Init (ms) | Min (ms) | Median (ms) | Max (ms) | Stddev (ms) | Mean (ms) |
---|---|---|---|---|---|---|---|---|
640×384 | int16 | 165.21 | 46.11 | 92.73 | 92.81 | 100.20 | 1.04 | 92.95 |
640×384 | mixed uint8 | 86.93 | 23.86 | 51.99 | 52.42 | 59.05 | 0.94 | 52.51 |
320×192 | int16 | 144.49 | 42.38 | 20.85 | 20.86 | 27.18 | 0.88 | 21.00 |
320×192 | mixed uint8 | 83.73 | 21.41 | 10.08 | 10.22 | 17.47 | 1.01 | 10.41 |
Reducing the input size to 320×192 significantly improved the inference speed. The model was able to process images in about 10 ms, while the 640×384 input size resulted in an average inference time of around 52 ms. The mixed uint8 quantization format also provided a significant speedup compared to the int16 format.
As a result, the 320×192 input size with mixed uint8 quantization was selected for deployment. This configuration provided the best balance between speed and accuracy, making it suitable for real-time applications.
The model first detects objects within the image using the YOLO11 architecture. This step identifies the bounding boxes around the objects of interest. For each detected object, the model then predicts the keypoints within the bounding box. These keypoints represent specific parts of the object, such as joints in a human hand. During inference, the model outputs heatmaps for each keypoint. These heatmaps are then processed to extract the most likely keypoint positions, which are the peaks in the heatmap.
To set up the environment, run the following commands:
python3 -m venv .venv --system-site-packages
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Copy the service files:
cp systemd/demo.service /etc/systemd/system/
cp systemd/kiosk.service /etc/systemd/system/
Reload the systemd daemon:
systemctl daemon-reload
To run the backend application, run:
systemctl enable --now demo.service
To start the kiosk window, run:
systemctl enable --now kiosk.service
To test servomotors, run: python -m tests.drivers
. This will set the servomotors to their initial position, go to the 50% position, and then to the 100% position. The servomotors will be set to their initial position again.
To perform a test inference, run: python -m tests.inference <dir>
. This will run the inference on the images in the specified directory and save the results next to the original images.