Skip to content

A mixed‐precision YOLO-Pose hand gesture recognition system based on the Synaptics Astra SL1680 NPU

Notifications You must be signed in to change notification settings

grinn-global/bionic-robot-hand-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bionic Robot Hand Demo

This repository contains the code for the Bionic Robot Hand Demo, a hand gesture recognition system using a bionic robot hand. The system uses a YOLO‐Pose model for hand landmark detection and controls the servomotors of the bionic hand based on the detected gestures.

The demo is designed to run on Astra SL1680, which is a part of the family of Synaptics Astra SoCs. The model itself is quantized to a mixed precision format to optimize performance on the NPU (Neural Processing Unit) accelerator.

demo_setup.mp4

Overview

The project consists of two main components: a backend application that runs the inference & controls the robot hand and a frontend application that displays the results in a kiosk window. The communication between the backend and frontend is done via a WebSocket using an established JSON API.

Frontend

The frontend application is a simple HTML page that displays the results of the inference in real time. It uses WebSocket to receive the JSON payload from the backend and updates the UI accordingly. The UI shows the detected hand landmarks, the bounding box, and game-related information such as the countdown timer and the result of the game.

demo_ui.mp4

Backend

The backend application is written in Python, and uses the SynapRT framework for running the YOLO‐Pose model inference. The main loop of the application is designed to continuously capture images from the camera, perform inference using the YOLO‐Pose model, and control the servomotors of the bionic hand based on the detected gestures.

Under the hood, the main function first spins up both an HTTP server and a WebSocket server, then it launches the SynapRT inference pipeline on its own thread, and then enters a tight infinite loop: each cycle it polls the pipeline for new detection results, computes averages of inference time, frame rate, and power usage, feeds the detected hand flexions into the RPS state machine, packages everything into a JSON payload, and broadcasts it over the WebSocket.

When a hand is detected for more than 10 seconds, the system switches to the rock, paper, scissors game mode. The robot hand stops mirroring the user's hand and instead performs a random gesture. The outcome of the game is determined by comparing the user's gesture with the robot's gesture. After the result is displayed, the system returns to the normal operation mode, where the robot hand mimics the user's hand.

Hardware Control

The bionic robot hand is driven by 5 servomotors, one for each finger. The servomotors are controlled using a PCA9685 expansions board, which is connected to the Astra SL1680 via an I2C interface.

The hardware control has been abstracted into separate modules, with an increasing level of abstraction. The lowest level is the servo.py module, which interacts with kernel driver of the PCA9685 servomotor controller via the sysfs interface. The next level is the hand.py module, which contains the high-level controllers for the servomotors. It takes raw 3D hand landmarks, filters out low-confidence points, computes inter-joint angles via vector math, smooths those angles over a moving window, normalizes them into 0-100% flexion values, and then either drives the servos to mirror the user's hand in real time or snaps to predefined rock-paper-scissors poses and predicts the player's gesture by matching flexions against pose templates.

The overall control flow of each finger is shown in the diagram below.

Dataset

The HAnd Gesture Recognition Image Dataset, or HaGRID is a 716 GB hand‐gesture recognition image dataset of 552,992 Full HD (1920×1080) RGB frames spanning 18 gesture classes plus an extra no_gesture class (123,589 samples), split by user ID into 92% train (509,323 images) and 8% test (43,669 images) sets over 34,730 unique subjects (ages 18-65) captured indoors under varied lighting (including backlighting) at 0.5-4 m.

All images are annotated in COCO format with per‐hand bounding boxes [x,y,width,height], 21 hand landmarks [x,y], gesture labels, leading_hand and leading_conf fields, and user_id for bespoke splits.

The train set is provided as 18 per‐gesture archives (about 38-41 GB each), the test split as a 60.4 GB image archive plus 27.3 MB annotation file, and there's also a 100‐sample‐per‐gesture subsample (2.5 GB images, 1.2 MB annotations).

The dataset includes joint landmarks for hand pose detection. Each hand has a total of 21 keypoints. They are annotated as follows:

  1. Wrist
  2. Thumb (4 points)
  3. Index finger (4 points)
  4. Middle finger (4 points)
  5. Ring finger (4 points)
  6. Little finger (4 points)

Model

YOLO‐Pose is a heatmap-free, single‐stage extension of the YOLO object detector that performs end-to-end joint detection and 2D multi-person pose estimation in one network. It outputs person bounding boxes along with their associated keypoint coordinates in a single forward pass, eliminating separate heatmap computation and post-processing grouping steps common to two-stage approaches .

While YOLO-Pose was originally developed for 2D multi-person human pose estimation, here it has been adapted via transfer learning and fine-tuned for hand landmark detection using the HaGRID dataset.

The YOLO family spans multiple model sizes, each differing in network depth, width and parameter count to balance accuracy against speed. In embedded or resource-constrained environments, the nano and small variants are preferred to ensure real-time inference with minimal compute and memory overhead.

Model Depth Width Parameters Use Case
YOLO11n-pose (Nano) Shallowest network, fewer layers. Narrower network, fewer channels per layer. Least number of parameters, making it lightweight and fast. Suitable for real-time applications on devices with limited compute.
YOLO11s-pose (Small) Deeper than YOLO11n-pose, more layers. Wider network, more channels per layer. More parameters than YOLO11n-pose, balancing speed and accuracy. Ideal for applications requiring a balance between performance and efficiency.

Training

Heatmaps are used to represent the probability distribution of keypoint locations by indicating how likely each pixel is to contain a keypoint. During training the model learns to generate these heatmaps by converting the ground-truth keypoints into target heatmaps and optimizing its parameters to minimize the discrepancy between its predictions and the ground-truth heatmaps.

The cls_loss in YOLO11-pose models does not classify keypoints but rather handles the classification of detected objects, similar to detection models. The pose_loss is specifically designed for keypoint localization, ensuring accurate placement of keypoints on the detected objects. The kobj_loss (keypoint objectness loss) balances the confidence of keypoint predictions, helping the model to distinguish between true keypoints and background noise.

For the project, both nano and small models were trained. However, the small version was selected for deployment due to its superior performance in terms of accuracy, while still maintaining a reasonable inference speed.

The models were benchmarked CPU-based inference on the target hardware to establish baseline performance across various input sizes.

Tensor Size Load (ms) Init (ms) Min (ms) Median (ms) Max (ms) Stddev (ms) Mean (ms)
640×384 162.10 609.14 558.52 574.68 590.52 9.61 574.65
320×192 174.65 198.18 145.84 146.88 166.53 5.60 149.04

Lowering the input size to 320×192 resulted in a significant speedup, with the model being able to process images in about 150 ms. The 640x384 input size resulted in an average inference time of around 574 ms.

Quantization

The model has been exported in ONNX format, and then quantized to a mixed precision format using the synap tool. This framework also handles the preprocessing of the input image, which is necessary for the model to work correctly. The preprocessing steps include resizing the input image to a specific size and normalizing the pixel values.

Different combinations of input sizes and quantization formats were tested to find the best performance. The aspect ratio of the input image was chosen, to match the camera resolution.

The following configurations were tested using the synap_cli tool:

Tensor Size Quantization Load (ms) Init (ms) Min (ms) Median (ms) Max (ms) Stddev (ms) Mean (ms)
640×384 int16 165.21 46.11 92.73 92.81 100.20 1.04 92.95
640×384 mixed uint8 86.93 23.86 51.99 52.42 59.05 0.94 52.51
320×192 int16 144.49 42.38 20.85 20.86 27.18 0.88 21.00
320×192 mixed uint8 83.73 21.41 10.08 10.22 17.47 1.01 10.41

Reducing the input size to 320×192 significantly improved the inference speed. The model was able to process images in about 10 ms, while the 640×384 input size resulted in an average inference time of around 52 ms. The mixed uint8 quantization format also provided a significant speedup compared to the int16 format.

As a result, the 320×192 input size with mixed uint8 quantization was selected for deployment. This configuration provided the best balance between speed and accuracy, making it suitable for real-time applications.

Detection

The model first detects objects within the image using the YOLO11 architecture. This step identifies the bounding boxes around the objects of interest. For each detected object, the model then predicts the keypoints within the bounding box. These keypoints represent specific parts of the object, such as joints in a human hand. During inference, the model outputs heatmaps for each keypoint. These heatmaps are then processed to extract the most likely keypoint positions, which are the peaks in the heatmap.

Setup

To set up the environment, run the following commands:

python3 -m venv .venv --system-site-packages
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Copy the service files:

cp systemd/demo.service /etc/systemd/system/
cp systemd/kiosk.service /etc/systemd/system/

Reload the systemd daemon:

systemctl daemon-reload

Running the Demo

To run the backend application, run:

systemctl enable --now demo.service

To start the kiosk window, run:

systemctl enable --now kiosk.service

Tests

To test servomotors, run: python -m tests.drivers. This will set the servomotors to their initial position, go to the 50% position, and then to the 100% position. The servomotors will be set to their initial position again.

To perform a test inference, run: python -m tests.inference <dir>. This will run the inference on the images in the specified directory and save the results next to the original images.

About

A mixed‐precision YOLO-Pose hand gesture recognition system based on the Synaptics Astra SL1680 NPU

Topics

Resources

Stars

Watchers

Forks