A compact, friendly, and reproducible sandbox for experimenting with Gaussian Naive Bayes (GNB) on cell-level classification tasks in biomedical imaging.
Why this repo? Because strong baselines and clear docs beat magic black boxes for small, sensitive datasets.
- π¬ Focused on cell-level classification tasks and small biomedical datasets
- β‘ Fast reproducible GNB baseline using scikit-learn Pipelines
- π§ Clear guidance on data layout, preprocessing, evaluation, and uncertainty reporting
- π§ͺ Notebook-driven examples and suggested scripts for reproducible experiments
Quick summary β click to expand
This project provides:
- A minimal, reproducible pipeline: data β preprocessing β features β StandardScaler β GaussianNB β evaluation.
- Jupyter notebook examples and recommended scripts for preparing data, training, and evaluation.
- Practical advice for small and imbalanced biomedical datasets (stratified splits, transforms, SMOTE when appropriate).
Below is a small visual overview of the recommended experiment flow. The SVG is included in the repo at assets/diagram-pipeline.svg
; a PNG fallback (recommended for renderers that don't show SVG) is available at assets/diagram-pipeline.png
.
Raw SVG source:
<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" width="900" height="220" viewBox="0 0 900 220">
<style>
.box { fill: #f6f9ff; stroke: #3b82f6; stroke-width:2; rx:8; }
.text { font-family: Arial, Helvetica, sans-serif; font-size:14px; fill:#0f172a }
.title { font-weight:700; font-size:13px; }
.arrow { stroke:#334155; stroke-width:2; fill:none; marker-end: url(#arrowhead);}
</style>
<defs>
<marker id="arrowhead" markerWidth="10" markerHeight="7" refX="10" refY="3.5" orient="auto">
<polygon points="0 0, 10 3.5, 0 7" fill="#334155" />
</marker>
</defs>
<!-- Boxes -->
<rect x="20" y="40" width="150" height="60" class="box"/>
<text x="95" y="70" class="text" text-anchor="middle">Data</text>
<text x="95" y="87" class="text" text-anchor="middle">(images / labels)</text>
<rect x="210" y="20" width="170" height="100" class="box"/>
<text x="295" y="55" class="text" text-anchor="middle">Preprocessing</text>
<text x="295" y="73" class="text" text-anchor="middle">resize β’ normalize β’ impute</text>
<rect x="410" y="40" width="170" height="60" class="box"/>
<text x="495" y="70" class="text" text-anchor="middle">Feature Extraction</text>
<rect x="610" y="20" width="220" height="100" class="box"/>
<text x="720" y="55" class="text" text-anchor="middle">Model Pipeline</text>
<text x="720" y="73" class="text" text-anchor="middle">StandardScaler β GaussianNB</text>
<rect x="350" y="150" width="200" height="50" class="box"/>
<text x="450" y="180" class="text" text-anchor="middle">Evaluation & Reporting</text>
<!-- Arrows -->
<path d="M170 70 L210 70" class="arrow" />
<path d="M380 70 L410 70" class="arrow" />
<path d="M580 70 L610 70" class="arrow" />
<path d="M500 100 L450 150" class="arrow" />
<path d="M450 200 L720 200" class="arrow" />
<!-- Small captions -->
<text x="120" y="35" class="text">1</text>
<text x="295" y="10" class="text">2</text>
<text x="495" y="35" class="text">3</text>
<text x="720" y="10" class="text">4</text>
</svg>
- Project overview
- Quick start (Windows PowerShell)
- Data layout and preparation
- Notebook and example scripts
- Modeling notes (Gaussian Naive Bayes)
- Evaluation and experiment protocol
- Reproducibility and environment
- Code organization and recommended files
- Contributing
- License and citation
- Further reading
- Contact and next steps
Purpose: provide a compact, well-documented baseline and learning resource for researchers, students, and practitioners who want to explore classical machine learning approaches for cell-level classification tasks in biomedical imaging. The focus is on clarity, reproducibility, and practical guidance rather than on providing large datasets or complex deep-learning pipelines.
Goals:
- Explain common dataset layouts and minimal preprocessing required to run tabular or simple image-based experiments.
- Provide a reproducible baseline pipeline using scikit-learn's GaussianNB with sensible preprocessing.
- Document recommended evaluation metrics and experiment protocols for small and imbalanced datasets.
- Offer an interactive notebook and a few scripts to help users get started quickly.
These steps assume Windows PowerShell. Adjust for other shells if needed.
- Create and activate a Python virtual environment (recommended Python 3.8+):
python -m venv .venv; .\\.venv\\Scripts\\Activate.ps1
- Install core dependencies used by the examples and notebook:
pip install --upgrade pip
pip install numpy pandas scikit-learn matplotlib seaborn notebook joblib imbalanced-learn
- Launch the interactive notebook (optional but recommended):
jupyter notebook notebook.ipynb
Notes: If you prefer VS Code, use the Jupyter integration there to open and run notebook.ipynb
.
This repository does not include private clinical datasets. Use your own data or public datasets and follow the layout below for reproducible experiments.
Recommended directory layout (local):
data/
images/ # raw images (optional), or a folder per class depending on your loader
labels.csv # minimal CSV: id,filename,label,[optional metadata columns]
features.csv # optional: precomputed features for tabular experiments (id, feat_1, ..., label)
Minimal labels.csv
example:
id,filename,label 1,cell_0001.png,benign 2,cell_0002.png,malignant
Guidelines and common preprocessing steps:
- Tabular features: standardize numeric features (zero mean and unit variance) before applying GNB when feature scales differ.
- Missing values: choose targeted imputation strategies (mean/median for continuous features, or model-based imputers) rather than blanket dropping if you have few samples.
- Images: resize to a consistent shape; normalize pixel intensities; if using microscopy images, consider stain/illumination normalization and morphological feature extraction.
- Imbalanced classes: prefer stratified splitting, and consider resampling (e.g., SMOTE) or thresholding strategies when reporting metrics.
See docs/dataset.md
for more detailed notes and the notebook for example loaders and extractors.
notebook.ipynb
: Guided walkthrough demonstrating data loading, simple feature extraction, training a GNB baseline with scikit-learn, cross-validation, and evaluation (ROC/PR, confusion matrix, classification report).scripts/prepare_data.py
: (recommended) createfeatures.csv
from raw images or raw tabular inputs.scripts/train.py
: (recommended) train a model and save a serialized pipeline (joblib) for reuse.scripts/evaluate.py
: (recommended) load a saved pipeline and compute test-set metrics and visualizations.
Example training pipeline (adapt in scripts or notebook):
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import roc_auc_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
pipe = make_pipeline(StandardScaler(), GaussianNB())
pipe.fit(X_train, y_train)
probs = pipe.predict_proba(X_test)[:, 1]
print('Test ROC AUC:', roc_auc_score(y_test, probs))
Why GNB:
- Fast, low-overhead baseline that works well for low-to-medium dimensional numeric features and small datasets.
- Good pedagogical model to establish a performance floor before trying more complex algorithms.
When to reconsider GNB:
- Strong feature correlation: covariance assumptions are violated and performance may suffer.
- Highly non-Gaussian features: consider simple transforms (log, Box-Cox) or more flexible models (RandomForest, XGBoost, or neural embeddings).
Practical tips:
- Always scale features before GNB if values have different units or magnitudes.
- Use stratified CV and repeated runs to understand variance from small datasets.
- If using image-derived features, consider using pretrained CNN embeddings as input features rather than raw pixels.
Recommended metrics and practices for reporting:
- Use ROC AUC for overall ranking performance and PR AUC for imbalanced problems.
- Report confusion matrix, precision, recall (sensitivity), specificity, and F1-score.
- Use stratified splits and nested CV for hyperparameter tuning when appropriate; otherwise use a held-out test set reserved until final evaluation.
- Quantify uncertainty: report confidence intervals (bootstrap) or repeated CV statistics where feasible.
Suggested step-by-step experiment workflow:
- Define the dataset and a reproducible split strategy (specify random seeds).
- Preprocess and fit baseline models (GNB) on training folds only.
- Tune hyperparameters using nested CV if you are tuning many model choices.
- Evaluate final model on a single held-out test fold and report metrics with uncertainty estimates.
To make experiments reproducible, log and fix the following:
- Python version (recommend 3.8+)
- Package versions for critical libraries: numpy, pandas, scikit-learn, imbalanced-learn, matplotlib, joblib
- Random seeds for numpy and scikit-learn (use
random_state
in API calls)
An example reproducibility snippet:
import numpy as np
np.random.seed(42)
If you want pinned dependencies, I can add a requirements.txt
or a environment.yml
for conda β tell me your preference (pip or conda).
Suggested layout for expanding the project:
src/ # project source code (data loaders, feature extraction, models, metrics)
scripts/ # small CLI scripts: prepare_data.py, train.py, evaluate.py
docs/ # user-facing documentation and tutorials
notebook.ipynb # interactive exploration and examples
data/ # not committed: store local copies of datasets during experiments
models/ # saved model artifacts (git-ignored)
Naming and style recommendations:
- Keep data loading and feature extraction separate from model code.
- Use scikit-learn Pipelines for preprocessing + model so saved artifacts include all steps.
- Add unit tests for small processing functions (feature extractors, CSV readers) as you expand the codebase.
Contributions are welcome. A minimal contributor guideline:
- Open an issue describing what you plan to change.
- Create a branch and implement changes with small, testable commits.
- Add or update documentation in
docs/
for visible changes. - Open a pull request with a description and list of changes.
Code quality:
- Follow PEP8 and add docstrings for public functions.
- Keep functions small and add tests for data processing logic.
This repository includes a LICENSE
file in the project root. Please review it for reuse and distribution terms.
If you use these materials in research or teaching, include a brief citation in your method section. If you would like, I can draft a formal citation snippet.
- scikit-learn: https://scikit-learn.org
- imbalanced-learn: https://imbalanced-learn.org
- Bishop, C. M., Pattern Recognition and Machine Learning (for probabilistic models and Naive Bayes theory)
If you want me to continue, I can do one of the following (choose one or ask for something else):
- Create a pinned
requirements.txt
orenvironment.yml
with versions used while authoring the docs (pip or conda). - Implement minimal
scripts/train.py
andscripts/evaluate.py
that accept CLI arguments and are tested with a tiny synthetic dataset. - Run a markdown linter across
docs/
and this README and fix style issues.
Please tell me which follow-up you'd like and I will implement it.
This project was authored and maintained by:
- NhanPhamThanh-IT β original author
Contributions welcome β please open an issue or send a PR. Add yourself to this list when you make a notable contribution.