Slip detection with Franka Emika and GelSight Sensors .
Author - Amit Parag
Instructor - Ekrem Misimi
The aim of the experiments is to learn the difference between slip and wriggle through videos by training a Video-Vision Transformer model.
Video Vision Tranformers were initially proposed in this paper.
We use the first variant - spatial transformer followed by a temporal one - in our experiments.
The training dataset were collected by performing the wriggling motion.
We define "wriggle" as a sequence of motions that involve
lifting an object,
rotationally shaking it
followed by tangential shake, vertical shake and perpendicular shake.
The object is then put back on the table.
The objects used for experiments are listed in object_info.txt
Two examples are shown below :
Rubick.Cube.mp4
coffee_cup.mp4
The occurence of slip is usually characterized by the properties of object in question such as its weight, elasticity, orientation of grip.
One example of slip is shown below.
Coil.of.WIres.mp4
This motion is repeated for 30 objects.
The resulting (slip) video (from one of the experiments) from the sensor attached to the gripper is shown below.
slip.mp4
Slip.mp4
An example of wriggle is
wriggle.mp4
After the data has been collected, we augment the data by adding noise and swapping channels in each video
A transformed video of 5 frames would look like:
aug_3.mp4
Data from 25 objects were kept aside for training. After data transformation the new augmented dataset contained 110 slip cases and 408 wriggle cases.
For training, the data folder needs to be arranged like so -
root_dir/
├── train/
├── slip/
├── video1.avi
├── video2.avi
└── ...
└── wriggle/
├── video1.avi
├── video2.avi
└── ...
├── test/
├── slip/
├── video1.avi
├── video2.avi
└── ...
└── wriggle/
├── video1.avi
├── video2.avi
└── ...
├── validation/
├── slip/
├── video1.avi
├── video2.avi
└── ...
└── wriggle/
├── video1.avi
├── video2.avi
└── ...
• image_size = (240,320), # image size
• frames = 450, # number of frames
• image_patch_size = (80,80), # image patch size
• frame_patch_size = 45, # frame patch size
• num_classes = 2,
• dim = 64,
• spatial_depth = 3, # depth of the spatial transformer
• temporal_depth = 3, # depth of the temporal transformer
• heads = 4,
• mlp_dim = 128
Training a bigger model on 16 or 32 Gb RAM leads to the script getting automatically killed. So, if you want to try it, make sure you have access to compute clusters and adapt the code for gpu. Should be fairly straightforward. This architecture took 17.35 hours to train for 250 epochs.
1: Installing real time kernel. See requirements below.
1. Marker Tracking
Marker tracking algorithms may fail to converge or ends up computing absurd vector fields. We experimented with marker tracking but ended up not using them.
2. Sensors
The Gelsight sensors are susceptible to damage. After a few experiments, the gel pad on one the sensors started to leak gel while second one somehow got scrapped off.
We initially started with 2 sensors, but then discarded the data from one of the sensors.
The resulting data is unusable
camera.mp4
Also note that the regular 3D printed grippers can develop cracks and break.
We initially used a normal 3D printer and then eventually a more "fancy" one, for instance, in the video "Coil of Wires", different grippers are used.
It should also be noted that the usb-c cabel connected to the GelSight sensors gets disconnected a lot in the middle of experiments. So you will have to redo the same experiment multiple times - frustrating but c'est la vie.
The pins of the mini sensor is a bit dodgy.
4. Low Batch Size
The training script uses a batch size of 4. While it is generally preferable to have a higher batch size, restrictions due to compute capabilities still apply.
5. Minor Convergence issues in the initial epochs
Sometimes, the network gets stuck in local minima. Either restart the experiment with different learning rate or let it run for a few more epochs.
For example, in one of the experiments, the network was trapped in a local minima - the validation accuracy score remained unchanged for 100 epochs for learning rate of 1e-3.
The usual irritating local minima stuff - change some parameter slightly.
6. OpenCV issues
There a a few encoding issues with opencv something to do with how it compresses and encodes data.
See requirements.txt
Numpy, preferably 1.20.0. Higher versions have changed numpy.bool to bool. Might lead to clashes.
See notes for instructions on installing real time kernel and libfranka.