You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: published-202301-chagneux-macrolitter.qmd
+54-48Lines changed: 54 additions & 48 deletions
Original file line number
Diff line number
Diff line change
@@ -77,7 +77,7 @@ format:
77
77
jupyter: python3
78
78
---
79
79
80
-
##Introduction
80
+
# Introduction
81
81
82
82
83
83
Litter pollution concerns every part of the globe. Each year, almost ten
@@ -301,9 +301,9 @@ plt.show()
301
301
```
302
302
303
303
304
-
##Related works
304
+
# Related works
305
305
306
-
###AI-automated counting
306
+
## AI-automated counting
307
307
308
308
Counting from images has been an ongoing challenge in computer vision. Most
309
309
works can be divided into (i) detection-based methods where objects are
@@ -318,7 +318,7 @@ achieve better counts at every frame, none of these methods actually attempt
318
318
to produce global video counts.
319
319
320
320
321
-
###Computer vision for macro litter monitoring
321
+
## Computer vision for macro litter monitoring
322
322
323
323
Automatic macro litter monitoring in rivers is still a relatively nascent
324
324
initiative, yet there have already been several attempts at using DNN-based
@@ -338,7 +338,7 @@ proposed to count litter directly in videos.
338
338
339
339
340
340
341
-
###Multi-object tracking
341
+
## Multi-object tracking
342
342
343
343
Multi-object tracking usually involves object detection, data association and
344
344
track management, with a very large number of methods already existing before
@@ -370,16 +370,16 @@ objects requires a new movement model, to take into account missing detections
370
370
and large camera movements.
371
371
372
372
373
-
##Datasets for training and evaluation
373
+
# Datasets for training and evaluation
374
374
375
375
Our main dataset of annotated images is used to train the object detector.
376
376
Then, only for evaluation purposes, we provide videos with annotated object
377
377
positions and known global counts. Our motivation is to avoid relying on
378
378
training data that requires this resource-consuming process.
379
379
380
-
###Images
380
+
## Images
381
381
382
-
####Data collection
382
+
### Data collection
383
383
384
384
With help from volunteers, we compile photographs of litter stranded on river
385
385
banks after increased river discharge, shot directly from kayaks navigating at
@@ -389,7 +389,7 @@ The resulting pictures depict trash items under the same conditions as the
389
389
video footage we wish to count on, while spanning a wide variety of
390
390
backgrounds, light conditions, viewing angles and picture quality.
391
391
392
-
####Bounding box annotation
392
+
### Bounding box annotation
393
393
394
394
For object detection applications, the images are annotated using a custom
395
395
online platform where each object is located using a bounding box. In this
@@ -461,9 +461,9 @@ for imgId in imgIds:
461
461
plt.axis('off')
462
462
```
463
463
464
-
###Video sequences
464
+
## Video sequences
465
465
466
-
####Data collection
466
+
### Data collection
467
467
468
468
For evaluation, an on-field study was conducted with 20 volunteers to manually
469
469
count litter along three different riverbank sections in April 2021, on the
@@ -476,14 +476,14 @@ videos amount to 20 minutes of footage at 24 frames per second (fps) and a
476
476
resolution of 1920x1080 pixels.
477
477
478
478
479
-
####Track annotation
479
+
### Track annotation
480
480
481
481
On video footage, we manually recovered all visible object trajectories on
482
482
each river section using an online video annotation tool (more details
483
483
in @sec-video-dataset-appendix for the precise methodology). From that, we
484
484
obtained a collection of distinct object tracks spanning the entire footage.
485
485
486
-
##Optical flow-based counting via Bayesian filtering and confidence regions
486
+
# Optical flow-based counting via Bayesian filtering and confidence regions
487
487
488
488
Our counting method is divided into several interacting blocks. First, a
489
489
detector outputs a set of predicted positions for objects in the current
@@ -494,9 +494,9 @@ tracking module, proposing distinct tracks for each object. A final
494
494
postprocessing step only keeps the best tracks which are enumerated to yield
495
495
the final count.
496
496
497
-
###Detector
497
+
## Detector
498
498
499
-
####Center-based anchor-free detection
499
+
### Center-based anchor-free detection
500
500
501
501
In most benchmarks, the prediction quality of object attributes like bounding
502
502
boxes is often used to improve tracking. For counting, however, point
@@ -525,7 +525,7 @@ detector outputs a set $\mathcal{D}_n = \{z_n^i\}_{1 \leq i \leq D_n}$ where
525
525
each $z_n^i = (x_n^i,y_n^i)$ specifies the coordinates of one of the $D_n$
526
526
detected objects.
527
527
528
-
###Training {#sec-detector_training}
528
+
## Training {#sec-detector_training}
529
529
530
530
Training the detector is done similarly as in @Proenca2020.
531
531
@@ -559,9 +559,9 @@ $$
559
559
\right.
560
560
$$
561
561
562
-
###Bayesian tracking with optical flow {#sec-bayesian_tracking}
562
+
## Bayesian tracking with optical flow {#sec-bayesian_tracking}
563
563
564
-
####Optical flow
564
+
### Optical flow
565
565
566
566
Between two timesteps $n-1$ and $n$, the optical flow $\Delta_n$ is a mapping
567
567
satisfying the following consistency constraint (@paragios2006):
@@ -576,7 +576,7 @@ coordinate on that grid. To estimate $\Delta_n$, we choose a simple
576
576
unsupervised Gunner-Farneback algorithm which does not require further
577
577
annotations, see @farneback2003two for details.
578
578
579
-
####State space model {#sec-state_space_model}
579
+
### State space model {#sec-state_space_model}
580
580
581
581
Using optical flow as a building block, we posit a state space model where
582
582
estimates of $\Delta_n$ are used as a time and state-dependent offset for the
@@ -604,7 +604,7 @@ centered Gaussian random variables with covariance matrix $R$.
604
604
In the following, $Q$ and $R$ are assumed to be diagonal, and are
605
605
hyperparameters set to values given in @sec-covariance_matrices.
606
606
607
-
#### Approximations of the filtering distributions
607
+
### Approximations of the filtering distributions
608
608
609
609
Denoting $u_{1:k} = (u_1,\ldots,u_k)$ for any $k$ and sequence $(u_i)_{i \geq
610
610
0}$, Bayesian filtering aims at computing the conditional distribution of
@@ -649,14 +649,14 @@ observations, as the contribution of $\Delta_k$ in every transition ensures
649
649
that each filter can cope with arbitrary inter-frame motion to keep track of
650
650
its target.
651
651
652
-
#### Generating potential object tracks
652
+
### Generating potential object tracks
653
653
654
654
The full MOT algorithm consists of a set of single-object trackers following
655
655
the previous model, but each provided with distinct observations at every
656
656
frame. These separate filters provide track proposals for every object
657
657
detected in the video.
658
658
659
-
### Data association using confidence regions {#sec-data_association}
659
+
## Data association using confidence regions {#sec-data_association}
660
660
661
661
Throughout the video, depending on various conditions on the incoming
662
662
detections, existing trackers must be updated (with or without a new
@@ -714,7 +714,7 @@ V_\delta(z_n^i))$, which is standard in modern MOT.
714
714
Visual representation of the tracking pipeline.
715
715
:::
716
716
717
-
### Counting
717
+
## Counting
718
718
719
719
At the end of the video, the previous process returns a set of candidate
720
720
tracks. For counting purposes, we find that simple heuristics can be further
@@ -734,7 +734,7 @@ optimized for best count performance (see @sec-tau_kappa_appendix for a
734
734
more comprehensive study).
735
735
736
736
737
-
## Metrics for MOT-based counting
737
+
# Metrics for MOT-based counting
738
738
739
739
Counting in videos using embedded moving cameras is not a common task, and as
740
740
such it requires a specific evaluation protocol to understand and compare the
@@ -743,7 +743,7 @@ even if some do provide insights to assist evaluation of count performance.
743
743
Second, considering only raw counts on long videos gives little information on
744
744
which of the final counts effectively arise from well detected objects.
745
745
746
-
### Count-related MOT metrics
746
+
## Count-related MOT metrics
747
747
748
748
Popular MOT benchmarks usually report several sets of metrics such as ClearMOT
749
749
(@bernardin2008) or IDF1 (@RistaniSZCT16) which can account for
@@ -753,7 +753,7 @@ and association using the Jaccard index. The following components of their
753
753
work are relevant to our task (we provide equation numbers in the original
754
754
paper for formal definitions).
755
755
756
-
#### Detection
756
+
### Detection
757
757
758
758
First, when considering all frames independently, traditional detection recall
759
759
($\mathsf{DetRe}$) and precision ($\mathsf{DetPr}$) can be computed to assess the capabilities
@@ -781,7 +781,7 @@ Thus, low $\mathsf{DetRe}$ could theoretically be compensated with robust tracki
781
781
Again, this suggests that low $\mathsf{DetPr}$ may allow decent counting performance.
782
782
783
783
784
-
####Association
784
+
### Association
785
785
786
786
HOTA association metrics are built to measure tracking performance
787
787
irrespective of the detection capabilities, by comparing predicted tracks
@@ -818,15 +818,15 @@ Consequently, $\mathsf{AssRe}$ does not account for tracks predicted from stream
818
818
arising from rocks, water reflections, etc).
819
819
Since such tracks induce false counts, a tracker which produces the fewest is better, but MOT metrics do not measure it.
820
820
821
-
###Count metrics
821
+
## Count metrics
822
822
823
823
Denoting by $\mathsf{\hat{N}}$ and $\mathsf{N}$ the respective predicted and ground truth
824
824
counts for the validation material, the error $\mathsf{\hat{N}} - \mathsf{N}$ is misleading as
825
825
no information is provided on the quality of the predicted counts.
826
826
Additionally, results on the original validation footage do not measure the
827
827
statistical variability of the proposed estimators.
828
828
829
-
####Count decomposition
829
+
### Count decomposition
830
830
831
831
Define $i \in [\![1, \mathsf{N}]\!]$ and $j \in [\![1, \mathsf{\hat{N}}]\!]$ the labels of the
832
832
annotated ground truth tracks and the predicted tracks, respectively. At
@@ -883,7 +883,7 @@ For more complicated data, an adaptation of such contributions into proper
883
883
counting metrics could be valuable.
884
884
885
885
886
-
####Statistics
886
+
### Statistics
887
887
888
888
Since the original validation set comprises only a few unequally long videos,
889
889
only absolute results are available. Splitting the original sequences into
@@ -893,7 +893,8 @@ $\hat{\sigma}_{\mathsf{\hat{N}}_\bullet}$ the associated empirical standard devi
893
893
computed on the set of short sequences.
894
894
895
895
896
-
## Experiments
896
+
# Experiments
897
+
897
898
We denote by $S_1$, $S_2$ and $S_3$ the three river sections of the evaluation material and split the associated footage into independent segments of 30 seconds. We further divide this material into two distinct validation (6min30) and test (7min) splits.
898
899
899
900
@@ -922,7 +923,7 @@ intermediate framerate to capture all objects while reducing the computational
922
923
burden.
923
924
924
925
925
-
###Detection
926
+
## Detection
926
927
927
928
In the following section, we present the performance of the trained detector.
928
929
Having annotated all frames of the evaluation videos, we directly compute
To fairly compare the three solutions, we calibrate the hyperparameters of our
1008
1010
postprocessing block on the validation split and keep the values that minimize
1009
1011
the overall count error $\mathsf{\hat{N}}$ for each of them separately (see
@@ -1123,7 +1125,7 @@ results_All
1123
1125
```
1124
1126
1125
1127
1126
-
#####Detailed results on individual segments
1128
+
#### Detailed results on individual segments
1127
1129
1128
1130
```{python}
1129
1131
@@ -1140,7 +1142,7 @@ for ax, title, tracker_name in zip(axes, pretty_method_names, method_names):
1140
1142
1141
1143
1142
1144
1143
-
##Practical impact and future goals
1145
+
# Practical impact and future goals
1144
1146
1145
1147
We successfully tackled video object counting on river banks, in particular issues which could be addressed independently of detection quality.
1146
1148
Moreover the methodology developed to assess count quality enables us to precisely highlight the challenges that pertain to video object counting on river banks.
@@ -1156,12 +1158,11 @@ The resulting field data will help better understand litter origin, allowing to
1156
1158
Correlations between macro litter density and environmental parameters will be studied (e.g., population density, catchment size, land use and hydromorphology).
1157
1159
Finally, our work naturally benefits any extension of macrolitter monitoring in other areas (urban, coastal, etc) that may rely on a similar setup of moving cameras.
1158
1160
1161
+
# Supplements
1159
1162
1160
-
## Supplements
1163
+
## Details on the image dataset {#sec-image_dataset_appendix}
1161
1164
1162
-
### Details on the image dataset {#sec-image_dataset_appendix}
1163
-
1164
-
#### Categories
1165
+
### Categories
1165
1166
1166
1167
In this work, we do not seek to precisely predict the proportions of the
1167
1168
different types of counted litter. However, we build our dataset to allow
@@ -1196,9 +1197,9 @@ Trash categories defined to facilitate porting to a counting system that allows
1196
1197
:::
1197
1198
1198
1199
1199
-
###Details on the evaluation videos {#sec-video-dataset-appendix}
1200
+
## Details on the evaluation videos {#sec-video-dataset-appendix}
1200
1201
1201
-
####River segments
1202
+
### River segments
1202
1203
1203
1204
In this section, we provide further details on the evaluation material.
1204
1205
@fig-river-sections shows the setup and positioning of the three river
@@ -1228,9 +1229,9 @@ The following items provide further details on the exact annotation process.
1228
1229
- We do not provide inferred locations when an object is fully occluded, but tracks restart with the same identity whenever the object becomes visible again.
1229
1230
- Tracks stop whenever an object becomes indistinguishable and will not reappear again.
1230
1231
1231
-
###Implementation details for the tracking module {#sec-tracking_module_appendix}
1232
+
## Implementation details for the tracking module {#sec-tracking_module_appendix}
1232
1233
1233
-
####Covariance matrices for state and observation noises {#sec-covariance_matrices}
1234
+
### Covariance matrices for state and observation noises {#sec-covariance_matrices}
1234
1235
1235
1236
In our state space model, $Q$ models the noise associated with the movement
1236
1237
model we posit in @sec-bayesian_tracking involving optical flow estimates,
@@ -1253,7 +1254,7 @@ As long as values are meaningful relative to the image dimensions and the size o
1253
1254
1254
1255
1255
1256
1256
-
####Influence of $\tau$ and $\kappa$ {#sec-tau_kappa_appendix}
1257
+
### Influence of $\tau$ and $\kappa$ {#sec-tau_kappa_appendix}
1257
1258
1258
1259
An understanding of $\kappa$, $\tau$ and $\nu$ can be stated as follows.
1259
1260
For any track, given a value for $\kappa$ and $\nu$, an observation at time $n$ is only kept if there are also $\nu \cdot \kappa$ observations in the temporal window of size $\kappa$ that surrounds $n$ (windows are centered around $n$ except at the start and end of the track).
@@ -1339,7 +1340,7 @@ for method_name, pretty_method_name in zip(method_names, pretty_method_names):
Considering a state space model with $(X_k, Z_k)_{k \geq 0}$ the random processes for the states and observations, respectively, the filtering recursions are given by:
1345
1346
@@ -1367,7 +1368,7 @@ $$Q_k = Q, R_k = R,$$
1367
1368
$$B_k = I, b_k = 0.$$
1368
1369
1369
1370
1370
-
###Computing the confidence regions {#sec-confidence_regions_appendix}
1371
+
## Computing the confidence regions {#sec-confidence_regions_appendix}
1371
1372
1372
1373
1373
1374
In words, $P(i,\ell)$ is the mass in $V_\delta(z_n^i) \subset \mathbb{R}^2$ of the
@@ -1412,7 +1413,7 @@ the corresponding confidence regions (see @sec-tracking_module_appendix above).
1412
1413
this distribution when SMC is used, and performance comparisons between the
1413
1414
EKF, UKF and SMC versions of our trackers are discussed.
1414
1415
1415
-
####SMC-based tracking
1416
+
### SMC-based tracking
1416
1417
1417
1418
Denote $\mathbb{Q}_k$ the filtering distribution (ie. that of $Z_k$ given $X_{1:k}$) for the HMM $(X_k,Z_k)_{k \geq 1}$ (omitting the dependency on the observations for notation ease).
1418
1419
Using a set of samples $\{X_k^i\}_{1 \leq i \leq N}$ and importance weights $\{w_k^i\}_{1 \leq i \leq N}$, SMC methods build an approximation of the following form:
@@ -1450,7 +1451,7 @@ recover object identities via $\widehat{\mathbb{L}}_n^{\ell}(V_\delta(z_n^i))$
1450
1451
computed for all incoming detections $\mathcal{D}_n = \{z_n^i\}_{1 \leq i \leq D_n}$ and each of the $1 \leq \ell \leq L_n$ filters, where $\widehat{\mathbb{L}}_n^{\ell}$ is the predictive distribution associated with the $\ell$-th filter.
1451
1452
1452
1453
1453
-
####Performance comparison
1454
+
### Performance comparison
1454
1455
1455
1456
In theory, sampling-based methods like UKF and SMC are better suited for
1456
1457
nonlinear state space models like the one we propose in @sec-state_space_model.
@@ -1462,3 +1463,8 @@ This comforts us into keeping it as a faster and computationally simpler option.
1462
1463
That said, this conclusion might not hold in scenarios where camera motion is even stronger, which was our main motivation to develop a flexible tracking solution and to provide implementations of UKF and SMC versions.
1463
1464
This allows easier extension of our work to more challenging data.
0 commit comments