-
Notifications
You must be signed in to change notification settings - Fork 0
Data Structure
This page describes the required data formats for the training and predictions to work. You can download some example data here. The information below applies to the example data, so you can cross reference and verify the information based on the example data.
The pipeline can process multiple nuclei in ROIs.
One ROI can encapsulate many nuclei at once and are represented in three files each.
One of those files is a .h5
file to represent the image color information (Y
) and a corresponding .h5
file featuring the label (y
) information (for training, if available) as well as a .csv
file for metadata.
A single .h5
file can hold many nuclei's image information (best as 8-bit) in a bundled file.
This file holds the image information in a key-value based relationship.
The example data ESM9_D2.h5
contains:
ESM9_D2_0:
[r, g, b]
ESM9_D2_1:
[r, g, b]
ESM9_D2_2:
[r, g, b]
ESM9_D2_3:
[r, g, b]
ESM9_D2_4:
[r, g, b]
ESM9_D2_5:
[r, g, b]
ESM9_D2_6:
[r, g, b]
...
For example, the entry ESM9_D2_0
is the key for a single nucleus.
There is no maximum limit for the number of keys a .h5
file can hold.
The [r, g, b]
is the red- green- and blue image color channel showing the nucleus, shown as a 8-bit m x n x 3 float array.
Per default, m=64 and n=64.
The neurite information should be in the red channel, oligodendrocyte in the green channel and nucleus staining in the blue channel.
As stated above, image information and metadata is linked via shared "keys".
Such a key, should match the following regular expression: (\w\d+)_\d+$
This is done to also extract metadata about the well from the key.
In the example case above, this means the information can be attributed to the well D02.
If you want to train a network, you also need to provide binary labels for every nucleus.
This file should be named the same as the image data, but end with _label.h5
and is not required if you predict data.
A single such .h5
file can hold many nuclei's labels in a bundled file.
This file holds the image information in a key-value based relationship.
The example data ESM9_D02_Manual-Neurons-1_label.h5
contains the following neuron-annotations:
ESM9_D2_0: 0
ESM9_D2_1: 0
ESM9_D2_2: 0
ESM9_D2_3: 0
ESM9_D2_4: 0
ESM9_D2_5: 0
ESM9_D2_6: 1
...
Since this is a binary classification, every label should be provided in an integer format, with 0
for negative label and 1
for positive.
Note, how the keys match with the image data above to link image information and labels.
There should be a label for every image.
If you want to predict unknown data with a pre-trained network, you also need to provide metadata for every nucleus.
This file should be named the same as the image data, but end with _overview.h5
.
A single such .csv
file (with ;
as the delimiter) can hold many nuclei's labels in a bundled file.
This file holds the image information in a key-value based relationship.
The example data ESM9_D02_Manual-Neurons-1_overview.csv
contains the following neuron-annotations:
Table notation:
ID | x | y | label (training version only) | label (prediction version only) |
---|---|---|---|---|
ESM9_D02_0 | 1921 | 2821 | 0 | ? |
ESM9_D02_1 | 1711 | 2830 | 0 | ? |
ESM9_D02_2 | 1819 | 2830 | 0 | ? |
ESM9_D02_3 | 1628 | 2832 | 0 | ? |
ESM9_D02_4 | 1561 | 2837 | 0 | ? |
ESM9_D02_5 | 1913 | 2842 | 0 | ? |
ESM9_D02_6 | 1806 | 2853 | 1 | ? |
... | ... | ... | ... |
Raw notation (prediction version):
ID;x;y;label
ESM9_D02_0;1921;2821;0
ESM9_D02_1;1711;2830;0
ESM9_D02_2;1819;2830;0
ESM9_D02_3;1628;2832;0
ESM9_D02_4;1561;2837;0
ESM9_D02_5;1913;2842;0
ESM9_D02_6;1806;2853;1
...
This file matches the image IDs with the x and y coordinates for every nucleus in the ROI.
Depending if you are predicting data or training a model, this file comes in two versions.
For the training version, the binary label is entered in integer format (see above).
If you want to predict data, the label
entry should be replaced with a ?
.
Note, how the IDs and keys match with the image data above to link image information and labels. There should be a label and coordinates for every image.
You can learn more by reading our corresponding publication: https://doi.org/10.1002/cyto.a.24514
This model, its results, and research have been published in Cytometry Part A on Now 7th, 2021. Read the open access article here: https://doi.org/10.1002/cyto.a.24514
Correspondence:
Prof. Dr. Axel Mosig, Bioinformatics Group, Ruhr Universität Bochum, Germany