Skip to content

CefasRepRes/flowcytometertool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Context

Flow cytometry is a powerful technology used to analyse the characteristics of particles—typically cells or microscopic organisms—as they flow in a fluid stream through a beam of light. On the RV Cefas Endeavour, a flow cytometer is used to collect high-resolution data on particles in seawater, capturing parameters like size, granularity, and fluorescence intensity for thousands of particles per second.

The raw data generated by the flow cytometer is stored in .CYZ files, a proprietary format used by the CytoClus software. These files contain complex, many-dimensional data that must be decoded, structured, and interpreted before it can be used for scientific analysis or machine learning.

This repository exists for the development of Python tools to process, visualize, and classify flow cytometry data. It supports workflows such as:

Converting .CYZ files to JSON using cyz2json Training machine learning models (e.g., random forests) to classify particles Visualizing and labeling data interactively Processing large datasets stored in Azure Blob containers Monitoring local directories for new data and applying trained models automatically

This python code is being developed with marine research in mind, where understanding the composition of microscopic life in water samples can inform studies on biodiversity and environmental change. We also expect the analysis of flow cytometry data will feed into indicators of ecosystem health.

Python Tools for CYZ File Processing

There are a few python tools here in various states. This repository resembles an attempt to put them in one place, anticipating that we will be using a random forest model to classify the flow cytometer data being generated on the RV cefas endeavour. It is developed around a handful of labelled flow cytometer files held in https://citprodflowcytosa.blob.core.windows.net/public/exampledata/ but you could export your own data from cytoclus software and put them in flowcytometertools/exampledata/. To do this in cytoclus, first select your file, with sets defined, under "Database" click Exports and check the box for CYZ file (for set) This was developed on windows but a github actions workflow tests whether the Download & Train tab will work on a linux machine. Ideally any users should be familiar with python because in all likelihood something will break.

Download

Check releases to download a "distributable". Unzip this, then open the software via: Whilst tested on linux, I have only made pyinstaller builds of the software for windows.

Compilation

This was developed in miniforge3 prompt. You can download this to your machine and compile the program with the command "pyinstaller flow_cytometer_tool.spec"

Github actions

For reproducibility and to test whether this can to run on another machine we use github actions on an ubuntu runner, in git bash terminal you can trigger a test build on github with a new VERSION, e.g: VERSION="0.0.0.1"; git tag -a v$VERSION -m "Release version $VERSION"; git push origin v$VERSION

Tabs Overview

1: Download & Train

This tab wraps the model training functions developed during Lucinda Lanoy’s Masters research in a simplified tkinter GUI (replacing the original R Markdown interface). It allows users to train machine learning models, including Random Forests, using scikit-learn. Click on the buttons in sequence from top-to-bottom, these will:

  1. Download the data from a blob store. By default this should be set to the public data in https://citprodflowcytosa.blob.core.windows.net/public/exampledata/ which needs no SAS authentication. If you change it to a blob store that needs authenticating, put a path to your SAS key saved as a plain .txt file in the "blob tools" tab.
  2. Downloads the cyz2json executable you need
  3. Applies the cyz2json to the downloaded data. Your CYZ files should now be json files instead (this step is invoked by subprocess and no check is implemented to ensure it has worked).
  4. Converts your json files to listmode csvs. "Listmode" is the part of the json file which pertains to the laser summaries. This step therefore leaves behind any images taken and the full pulse shape is not taken out of the json files either.
  5. Combine CSVs, specifying the Zone you want to train for. Training across multiple zones is not yet implemented. Note the expertise matrix which assigns a level of expertise (1 being non-expert, 2 being intermediate, 3 being expert). If there is a disagreement on a label, the expert will be prioritised.
  6. Train model. A split of your data will be taken for training, some will be retained for testing. This trains a LOT of models in sklearn, searching for the best model variables in your data and best hyperparameters.
  7. Test classifier against the test dataset.
  • You can run the training process on both Windows and Linux (tested via GitHub Actions on a Linux runner).
  • Note: Building a release hasn’t been tested recently.

2: Visualise & Label

Once a model is trained and tested, this tab allows users to explore and label the data interactively.

  • Visualizes predictions and other data columns using scatter plots.
  • Allows relabeling of data points directly in the interface.
  • Supports loading additional CSV files (e.g., the mixfile) for exploration.
  • Known issues:
    • Crashes if you try to color by a column with too many unique values.
    • Some functionality is currently broken, especially when loading external CSVs.
    • You must select both X and Y axes before clicking "Update Plot" — otherwise, a KeyError will occur.

3: Make Mixfile

This tool creates a representative training dataset by sampling from processed predictions.

  • Performs a 1-in-1000 random shuffle of particles from multiple processed files stored in the blob container.
  • Saves the result as a CSV file that can be visualized in the previous tab.
  • Intended to capture environmental variability by aggregating data across many samples.
  • Known issues:
    • Not recently tested.
    • Relies on SAS token access but how to do this may not be clear to the user. The app needs some signposting, explanation of how SAS is generated, saved and how to copy paste the path in. Or we could have an encrypted approach that does not exist across sessions?

4: Process Blob Container

This tab automates the full pipeline for processing .CYZ files stored in an Azure Blob container.

  • Downloads .CYZ files from the blob store.
  • Converts them to JSON using cyz2json.
  • Extracts listmode parameters and applies the trained model (using R Random Forest).
  • Generates 3D plots and uploads results back to the blob store.
  • Known issues:
    • Requires manual input of source directory, destination directory, and SAS token. Not user-friendly due to SAS token handling.

5: Local Watcher

This utility monitors a local directory for new .CYZ files and automatically processes them.

  • Applies the trained model to each new file as it appears.
  • Runs the same processing steps as the blob container tab, including 3D plotting.
  • Outputs results to a specified destination directory.
  • Known issues:
    • Not recently tested.

Acknowledgements to

Lucinda Lanoy for her masters work in custom_functions_for_python.py https://github.com/CefasRepRes/lucinda-flow-cytometry on model training.

Sebastien Galvagno, Eric Payne and Rob Blackwell for their parts played in cyz2json (flowcytometertool uses https://github.com/OBAMANEXT/cyz2json/releases/tag/v0.0.5)

OBAMA-NEXT Data labellers Veronique, Lumi, Zeline, Lotty and Clementine.

• Lotty = EXP1, "expert" level on Mediterranean data, considered "non expert" for the other zones, • Clementine = EXP2, "advanced" level on Mediterranean data, considered "non expert" for the other zones, • Lumi = EXP3, "expert" level on Baltic data, considered "non expert" for the other zones, • Zeline = EXP4, "expert" level on English Channel data, considered "non expert" for the other zones, • Veronique = EXP5, "expert" level on Celtic data, considered "non expert" for the other zones. • Joe = EXP6, "non expert" for all zones.

Laser icon by Icons8

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages