Skip to content

Data Structure

Nils edited this page Dec 3, 2021 · 3 revisions

This page describes the required data formats for the training and predictions to work. You can download some example data here. The information below applies to the example data, so you can cross reference and verify the information based on the example data.

Data Splitting

The pipeline can process multiple nuclei in ROIs. One ROI can encapsulate many nuclei at once and are represented in three files each. One of those files is a .h5 file to represent the image color information (Y) and a corresponding .h5 file featuring the label (y) information (for training, if available) as well as a .csv file for metadata.

Image Data

A single .h5 file can hold many nuclei's image information (best as 8-bit) in a bundled file. This file holds the image information in a key-value based relationship. The example data ESM9_D2.h5 contains:

ESM9_D2_0:
   [r, g, b]
ESM9_D2_1:
   [r, g, b]
ESM9_D2_2:
   [r, g, b]
ESM9_D2_3:
   [r, g, b]
ESM9_D2_4:
   [r, g, b]
ESM9_D2_5:
   [r, g, b]
ESM9_D2_6:
   [r, g, b]

...

For example, the entry ESM9_D2_0 is the key for a single nucleus. There is no maximum limit for the number of keys a .h5 file can hold.

The [r, g, b] is the red- green- and blue image color channel showing the nucleus, shown as a 8-bit m x n x 3 float array. Per default, m=64 and n=64. The neurite information should be in the red channel, oligodendrocyte in the green channel and nucleus staining in the blue channel.

Keys

As stated above, image information and metadata is linked via shared "keys". Such a key, should match the following regular expression: (\w\d+)_\d+$ This is done to also extract metadata about the well from the key. In the example case above, this means the information can be attributed to the well D02.

Label Data

If you want to train a network, you also need to provide binary labels for every nucleus. This file should be named the same as the image data, but end with _label.h5 and is not required if you predict data. A single such .h5 file can hold many nuclei's labels in a bundled file. This file holds the image information in a key-value based relationship. The example data ESM9_D02_Manual-Neurons-1_label.h5 contains the following neuron-annotations:

ESM9_D2_0: 0
ESM9_D2_1: 0
ESM9_D2_2: 0
ESM9_D2_3: 0
ESM9_D2_4: 0
ESM9_D2_5: 0
ESM9_D2_6: 1

...

Since this is a binary classification, every label should be provided in an integer format, with 0 for negative label and 1 for positive. Note, how the keys match with the image data above to link image information and labels. There should be a label for every image.

Metadata

If you want to predict unknown data with a pre-trained network, you also need to provide metadata for every nucleus. This file should be named the same as the image data, but end with _overview.h5. A single such .csv file (with ; as the delimiter) can hold many nuclei's labels in a bundled file. This file holds the image information in a key-value based relationship. The example data ESM9_D02_Manual-Neurons-1_overview.csv contains the following neuron-annotations:

Table notation:

ID x y label (training version only) label (prediction version only)
ESM9_D02_0 1921 2821 0 ?
ESM9_D02_1 1711 2830 0 ?
ESM9_D02_2 1819 2830 0 ?
ESM9_D02_3 1628 2832 0 ?
ESM9_D02_4 1561 2837 0 ?
ESM9_D02_5 1913 2842 0 ?
ESM9_D02_6 1806 2853 1 ?
... ... ... ...

Raw notation (prediction version):

ID;x;y;label
ESM9_D02_0;1921;2821;0
ESM9_D02_1;1711;2830;0
ESM9_D02_2;1819;2830;0
ESM9_D02_3;1628;2832;0
ESM9_D02_4;1561;2837;0
ESM9_D02_5;1913;2842;0
ESM9_D02_6;1806;2853;1

...

This file matches the image IDs with the x and y coordinates for every nucleus in the ROI.

Depending if you are predicting data or training a model, this file comes in two versions. For the training version, the binary label is entered in integer format (see above). If you want to predict data, the label entry should be replaced with a ?.

Note, how the IDs and keys match with the image data above to link image information and labels. There should be a label and coordinates for every image.

More Information

You can learn more by reading our corresponding publication: https://doi.org/10.1002/cyto.a.24514

Language grade: Python

Clone this wiki locally