|
1 | 1 | # Benchmarks
|
2 | 2 |
|
3 |
| -## How to: |
| 3 | +**This is a fair and controlled comparison between the different cell/nuclei-segmentation models that are implemented in this library.** |
4 | 4 |
|
5 |
| -### Segmentation performance. |
| 5 | +### Background info |
6 | 6 |
|
7 |
| -### Model latency. |
| 7 | +- **<span style="color:green">Cell/Nuclei-segmentation</span>** performance is benchmarked against the [**Pannuke**](https://arxiv.org/abs/2003.10778) and [**Lizard**](http://arxiv.org/abs/2108) datasets. |
| 8 | +- **<span style="color:orange">Panoptic-segmentation performance</span>** performance is benchmarked against **HGSOC** and **CIN2** datasets. |
| 9 | + |
| 10 | +#### Segmentation Performance Metrics |
| 11 | + |
| 12 | +- Panoptic Quality (PQ) |
| 13 | + - The bPQ (cell type unaware), mPQ (cell type aware) and cell type specific PQs for all the models are reported |
| 14 | +- Mean Interception over Union (mIoU) |
| 15 | + - mIoU is also reported for the semantic-segmentation results of the panoptic-segmentation models. |
| 16 | + |
| 17 | +#### Latency Metrics for Multipart Models |
| 18 | + |
| 19 | +Remember that these models are multipart. Each of the models are composed of an encoder-decoder neural-network and a post-processing pipeline. Thus, for all of the models, we report: |
| 20 | + |
| 21 | +- Number of parms in the encoder-decoder architecture |
| 22 | +- Encoder-decoder FLOPS |
| 23 | +- Encoder-decoder latency (img/s) |
| 24 | +- Post-processing latencies (img/s) |
| 25 | +- Total latency (img/s). |
| 26 | + |
| 27 | +Note that the post-processing pipelines are often composed of several parts. For nuclei/cell-segmentation, the post-processing pipeline is composed of a nuclei instance separation part and a cell type majority voting part. The latency for these are benchmarked separately. For panoptic segmentation, also the semantic-segmentation post-processing part is benchmarked separately. **The reported latency metrics are an average over the validation split.** |
| 28 | + |
| 29 | +#### Devices |
| 30 | + |
| 31 | +The model latencies depend on the hardware. I'll benhmark the latencies on my laptop and on a HPC server. |
| 32 | + |
| 33 | +- Laptop specs: |
| 34 | + - a worn-out NVIDIA GeForce RTX 2080 Mobile (8Gb VRAM) |
| 35 | + - Intel i7-9750H 6 x 6 cores @ 2.60GHz (32 GiB RAM) |
| 36 | +- HPC specs: |
| 37 | + - Nvidia V100 (32 GB VRAM) |
| 38 | + - Xeon Gold 6230 2 x 20 cores @ 2,1 GHz (384 GiB RAM) |
| 39 | + |
| 40 | +#### About the Datasets |
| 41 | + |
| 42 | +**Pannuke** is the only dataset that contains fixed sized (256x256) patches so the benchmarking is straight-forward and not affected by the hyperparameters of the post-processing pipelines. However, the **Lizard**, **HGSOC**, and **CIN2** datasets contain differing sized images. This means, firstly, that the patching strategy of the training data-split will have an effect on the model performance, and secondly, that the inference requires a sliding-window approach. The segmentation performance is typically quite sensitive to the sliding-window hyperparameters, especially, to the `patch size` and `stride`. Thus, with these datasets, I'm going to also report the training data patching strategy and we also grid-search the best sliding-window hyperparameters. |
| 43 | + |
| 44 | +#### Data Splits |
| 45 | + |
| 46 | +**Pannuke** and **Lizard** datasets are divided in three splits. For these datasets, we report the mean of the 3-fold cross-validation. The **CIN2** and **HGSOC** datasets contain only a training splits and relatively small validation splits, thus, for those datasets we report the metrics on the validation split. |
| 47 | + |
| 48 | +#### Regularization methods |
| 49 | + |
| 50 | +The models are regularized during training via multiple regularization techniques to tackle distrubution shifts. Specific techniques (among augmentations) that are used in this benchmark are: |
| 51 | + |
| 52 | +- [Spectral decoupling](https://arxiv.org/abs/2011.09468) |
| 53 | +- [Label Smoothing](https://arxiv.org/abs/1512.00567) |
| 54 | +- [Spatially Varying Label Smoothing](https://arxiv.org/abs/2104.05788) |
| 55 | + |
| 56 | +#### Pre-trained backbone encoders |
| 57 | + |
| 58 | +All the models are trained/fine-tuned with an IMAGENET pre-trained backbone encoder that is naturally reported. |
| 59 | + |
| 60 | +#### Training Hyperparams |
| 61 | + |
| 62 | +All the training hyperparameters are naturally reported. |
| 63 | + |
| 64 | +#### Other Notes |
| 65 | + |
| 66 | +Note that even if these benchmarks are not SOTA or differ from the original manuscripts, the reason for that are likely not the model-architecture or the post-processing method (since these are the same here) but rather the model weight initialization, loss-functions, training hyperparameters, regularization techniques, and other training tricks that affect the model performance. |
| 67 | + |
| 68 | +## Baseline models |
| 69 | + |
| 70 | +### <span style="color:green">Cell/Nuclei-segmentation</span> |
| 71 | + |
| 72 | +#### Results Pannuke |
| 73 | + |
| 74 | +##### Training Set-up |
| 75 | + |
| 76 | +| Param | Value | |
| 77 | +| ---------------------- | ----------------------------------------- | |
| 78 | +| Optimizer | [AdamP](https://arxiv.org/abs/2006.08217) | |
| 79 | +| Auxilliary Branch Loss | MSE-SSIM | |
| 80 | +| Type Branch Loss | Focal-DICE | |
| 81 | +| Encoder LR | 0.00005 | |
| 82 | +| Decoder LR | 0.0005 | |
| 83 | +| Scheduler | Reduce on plateau | |
| 84 | +| Batch Size | 10 | |
| 85 | +| Training Epochs | 50 | |
| 86 | +| Augmentations | Blur, Hue Saturation | |
| 87 | + |
| 88 | +#### Results Lizard |
| 89 | + |
| 90 | +##### Training Set-up |
| 91 | + |
| 92 | +Same as above. |
| 93 | + |
| 94 | +##### Patching Set-up |
| 95 | + |
| 96 | +##### Sliding-window Inference Hyperparams |
0 commit comments