We try to detect what product a shopper looks at in a retail store using single-view RGB images from a CCTV (3rd person) viewpoint. In this project, we have implemented Gaze object localization only. The assumption is that after a successful localization, we can directly use BBox of the detected object and pass through a classification pipeline to get the product category. Taking inspiration from the task similarities between single-stage detectors like CenterNet and gaze-following; Here we combine both in a single (stage) model.
The model presented in A Modular Multimodal Architecture for Gaze Target Prediction: Application to Privacy-Sensitive Settings by Anshul Gupta, Samy Tafasca and Jean-Marc Odobez. Many thanks to the authors for their awesome work. GitHub repo.
At the time of the implementation, the most relevant dataset for this particular task is the GOO dataset. The dataset contains real (9,552 samples) and synthetic (192000 samples) images. More details about the dataset can be found below-
Below, ground truth images from the GOO-Real dataset can sufficiently explain the end goal. The target product is highlighted in a green BBox with a Gaussian heatmap as the gaze localization problem is traditionally solved as a regression problem.
- torch
- torchvision
- numpy
- pandas
- cv2
- matplotlib
- csv
- PIL
- datetime
- timm: >= 0.6.13
- Instead of using a Gaussian of fixed size ( std = 3 ) as used generally in gaze following tasks we use dynamic size Gaussian (decided by BBox of the GT gazed at object) like in Centernet.
- We find that using centerNet regression BBoX sector head and training the model does not work directly out of the box. This can be possible due to the longer training time associated with centerNet. This might present a bottleneck since we do not have a large dataset since we are using GOO Real images only. Instead, we opt to use TTFNet, which indeed can be trained in a much shorter time than centerNet.
- Some changes were also based on empirical data, such as replacing RELUs with GELUs.
Below are a few images highlighting results obtained where we can visualize (topk = 3) gazed-at-object detection results. The GT image with the gazed-at-object (green) BBoX and gaze heatmap is shown on the left. The prediction results (with red, topk=3 ) BBoX were obtained directly from the TTFNet regression head and predicted gaze heatmap from the baseline gaze detection model output. For the sake of completeness, we include both success and failure cases below:
The following results are obtained with these initial configurations:-
Training Dataset = GOO Real Train
topk = 1
loss function (for baseline model) = l2_loss + dir_loss + att_loss
loss function ( for extended model) = l2_loss + dir_loss + att_loss + ttfnet_hm_loss + ttfnet_wh_loss
We also calculate the prediction accuracy of the models in two ways:
- energy aggregation accuracy: The predicted BBoX is selected as one where the predicted heatmap has the maximum energy in the GT BBoX.
- BBoX head topk accuracy: Here we predict topk BBoX obtained after processing the results of the BBoX prediction head. In case of topk (n > 1), we perform a simple NMS followed by keeping only BBoX with the score above the threshold score (taken as 0.2 by default).
The changes wrt. baseline config is specified separately in the table.
Model type | Test Datset | AUC | Min. Distance (in pixels) | Avg. Angular Distance (in degrees) | Max energy accuracy(in %) | BBoX head topk accuracy (in %) |
---|---|---|---|---|---|---|
Baseline | Dense Goo Real | 0.9553 | 0.1160 | 19.0602 | 35.6 | NA |
Extended | Dense Goo Real | 0.9849 | 0.1115 | 18.9836 | 38 | 33.41 |
Extended | Sparse Goo Real | 0.9870 | 0.1256 | 22.4442 | 36.23 | 30.72 |
Extended [energy] | Dense Goo Real | 0.6280 | 0.1056 | 19.5412 | 43.8 | 34.57 |
Extended [energy] | Sparse Goo Real | 0.6862 | 0.1205 | 23.5692 | 39.5 | 25.51 |
Extended [topk=3] | Dense Goo Real | 0.9809 | 0.1060 | 20.0947 | 43.97 | 53.28 |
where Extended [energy] uses the following loss function for the extended model: l2_loss + dir_loss + att_loss + ttfnet_hm_loss + ttfnet_wh_loss + energy_aggregation_loss and with topk (n>1) considers the case where all of topk BBoX are used for gazed-at-object localization
From the above, results we draw the following conclusions:
- Extended model does indeed help in improving the results. So our initial hypothesis of fusing the gaze detection model with SSD detector due to task similarity makes somewhat empirical sense.
- It is encouraging to see that there is not a large drop in performance while testing in the sparse setting. While the min. distance and avg. distance increases are expected as the objects are now placed far apart. Comparable accuracy in both modes also proves that our model can inherently focus on useful objects/products in a typical retail scene.
- Adding energy aggregation loss leads to an improvement of ~15% in the BBoX pred accuracy with max energy as expected. Counterintuitively, the AUC values decrease by ~36%. This does not make sense in the first look. I conclude that the AUC metric is not an appropriate metric to evaluate the gazed-at-object localization/classification task. Following the trend in literature, it is mentioned here due to its popular position in legacy and (current) gaze following tasks.
- The last result with topk = 3 can be slightly misleading as in this we relax our prediction results. This configuration not only would improve our topk accuracy but also increase our false positive rates. Hence just like other object detection tasks, AP and mAP would serve as a more appropriate metrics.
This work attempts to check whether single-view RGB offers enough information for successful gazed-at-object localization. Since this problem is ill-defined, we expand the information dimensions available by directly starting from a multi-modal solution. Indeed, using pose and (monocular) depth information along with image data helps to solve some of the ambiguity in the task. Thanks to the authors of A Modular Multimodal Architecture for Gaze Target Prediction: Application to Privacy-Sensitive Settings for their awesome work that serves as solid foundation to the current attempt. We feel that works like centerNet would be a natural extension of this task. The amalgamation of the two models is what is presented in this work. Not fully satisfied with the results, we suggest some possible improvements in the following section.
In the single-view RGB case, the current model could still be extended to opportunistically use the eye data whenever a person's eyes are not occluded in the current frame. Using the GOO-Synth dataset (which is quite large compared to the GOO-Real dataset) could also help in (pre) training a larger model. To deterministically solve the current problem (I believe), we need to solve the problem in a multi-view setting. This would not only alleviate the ambiguity of the problem definition (and thus make it more well-defined) but also solve some issues like occlusion. The (gazed-at-object) localization pipeline could then be extended and used for other applications like self-checkout. Here, along with tracking the shopper's head (in 3D), we also track the products throughout the retail store. However, at the time of this work (in the first quarter of 2022) we couldn't find any open-source dataset for the task multi-view gazed-at-object localization (or classification) in the retail setting. Our work in the multi-view stereo setting covered here was a small attempt to solve the problem in MVS.
- Add the modality extraction script.
- Provide pre-trained weights for the extended model.