Data Structure

This page describes the required data formats for the training and predictions to work. You can download some example data here. The information below applies to the example data, so you can cross reference and verify the information based on the example data.

Data Splitting

The pipeline can process multiple nuclei in ROIs. One ROI can encapsulate many nuclei at once and are represented in three files each. One of those files is a .h5 file to represent the image color information (Y) and a corresponding .h5 file featuring the label (y) information (for training, if available) as well as a .csv file for metadata.

Image Data

A single .h5 file can hold many nuclei's image information (best as 8-bit) in a bundled file. This file holds the image information in a key-value based relationship. The example data ESM9_D2.h5 contains:

ESM9_D2_0:
   [r, g, b]
ESM9_D2_1:
   [r, g, b]
ESM9_D2_2:
   [r, g, b]
ESM9_D2_3:
   [r, g, b]
ESM9_D2_4:
   [r, g, b]
ESM9_D2_5:
   [r, g, b]
ESM9_D2_6:
   [r, g, b]

...

For example, the entry ESM9_D2_0 is the key for a single nucleus. There is no maximum limit for the number of keys a .h5 file can hold.

The [r, g, b] is the red- green- and blue image color channel showing the nucleus, shown as a 8-bit m x n x 3 float array. Per default, m=64 and n=64. The neurite information should be in the red channel, oligodendrocyte in the green channel and nucleus staining in the blue channel.

Keys

As stated above, image information and metadata is linked via shared "keys". Such a key, should match the following regular expression: (\w\d+)_\d+$ This is done to also extract metadata about the well from the key. In the example case above, this means the information can be attributed to the well D02.

Label Data

If you want to train a network, you also need to provide binary labels for every nucleus. This file should be named the same as the image data, but end with _label.h5 and is not required if you predict data. A single such .h5 file can hold many nuclei's labels in a bundled file. This file holds the image information in a key-value based relationship. The example data ESM9_D02_Manual-Neurons-1_label.h5 contains the following neuron-annotations:

ESM9_D2_0: 0
ESM9_D2_1: 0
ESM9_D2_2: 0
ESM9_D2_3: 0
ESM9_D2_4: 0
ESM9_D2_5: 0
ESM9_D2_6: 1

...

Since this is a binary classification, every label should be provided in an integer format, with 0 for negative label and 1 for positive. Note, how the keys match with the image data above to link image information and labels. There should be a label for every image.

Metadata

If you want to predict unknown data with a pre-trained network, you also need to provide metadata for every nucleus. This file should be named the same as the image data, but end with _overview.h5. A single such .csv file (with ; as the delimiter) can hold many nuclei's labels in a bundled file. This file holds the image information in a key-value based relationship. The example data ESM9_D02_Manual-Neurons-1_overview.csv contains the following neuron-annotations:

Table notation:

ID	x	y	label (training version only)	label (prediction version only)
ESM9_D02_0	1921	2821	0	?
ESM9_D02_1	1711	2830	0	?
ESM9_D02_2	1819	2830	0	?
ESM9_D02_3	1628	2832	0	?
ESM9_D02_4	1561	2837	0	?
ESM9_D02_5	1913	2842	0	?
ESM9_D02_6	1806	2853	1	?
...	...	...	...

Raw notation (prediction version):

ID;x;y;label
ESM9_D02_0;1921;2821;0
ESM9_D02_1;1711;2830;0
ESM9_D02_2;1819;2830;0
ESM9_D02_3;1628;2832;0
ESM9_D02_4;1561;2837;0
ESM9_D02_5;1913;2842;0
ESM9_D02_6;1806;2853;1

...

This file matches the image IDs with the x and y coordinates for every nucleus in the ROI.

Depending if you are predicting data or training a model, this file comes in two versions. For the training version, the binary label is entered in integer format (see above). If you want to predict data, the label entry should be replaced with a ?.

Note, how the IDs and keys match with the image data above to link image information and labels. There should be a label and coordinates for every image.

More Information

You can learn more by reading our corresponding publication: https://doi.org/10.1002/cyto.a.24514

This model, its results, and research have been published in Cytometry Part A on Now 7th, 2021. Read the open access article here: https://doi.org/10.1002/cyto.a.24514

Correspondence:

Prof. Dr. Axel Mosig, Bioinformatics Group, Ruhr Universität Bochum, Germany

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Structure

Data Splitting

Image Data

Keys

Label Data

Metadata

More Information

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally