✨ GNB-Cancer-Cell-Classification

A compact, friendly, and reproducible sandbox for experimenting with Gaussian Naive Bayes (GNB) on cell-level classification tasks in biomedical imaging.

Why this repo? Because strong baselines and clear docs beat magic black boxes for small, sensitive datasets.

Highlights

🔬 Focused on cell-level classification tasks and small biomedical datasets
⚡ Fast reproducible GNB baseline using scikit-learn Pipelines
🧭 Clear guidance on data layout, preprocessing, evaluation, and uncertainty reporting
🧪 Notebook-driven examples and suggested scripts for reproducible experiments

Quick summary — click to expand

This project provides:

A minimal, reproducible pipeline: data → preprocessing → features → StandardScaler → GaussianNB → evaluation.
Jupyter notebook examples and recommended scripts for preparing data, training, and evaluation.
Practical advice for small and imbalanced biomedical datasets (stratified splits, transforms, SMOTE when appropriate).

Pipeline diagram

Below is a small visual overview of the recommended experiment flow. The SVG is included in the repo at assets/diagram-pipeline.svg; a PNG fallback (recommended for renderers that don't show SVG) is available at assets/diagram-pipeline.png.

Raw SVG source:

<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" width="900" height="220" viewBox="0 0 900 220">
  <style>
    .box { fill: #f6f9ff; stroke: #3b82f6; stroke-width:2; rx:8; }
    .text { font-family: Arial, Helvetica, sans-serif; font-size:14px; fill:#0f172a }
    .title { font-weight:700; font-size:13px; }
    .arrow { stroke:#334155; stroke-width:2; fill:none; marker-end: url(#arrowhead);}
  </style>
  <defs>
    <marker id="arrowhead" markerWidth="10" markerHeight="7" refX="10" refY="3.5" orient="auto">
      <polygon points="0 0, 10 3.5, 0 7" fill="#334155" />
    </marker>
  </defs>

  <!-- Boxes -->
  <rect x="20" y="40" width="150" height="60" class="box"/>
  <text x="95" y="70" class="text" text-anchor="middle">Data</text>
  <text x="95" y="87" class="text" text-anchor="middle">(images / labels)</text>

  <rect x="210" y="20" width="170" height="100" class="box"/>
  <text x="295" y="55" class="text" text-anchor="middle">Preprocessing</text>
  <text x="295" y="73" class="text" text-anchor="middle">resize • normalize • impute</text>

  <rect x="410" y="40" width="170" height="60" class="box"/>
  <text x="495" y="70" class="text" text-anchor="middle">Feature Extraction</text>

  <rect x="610" y="20" width="220" height="100" class="box"/>
  <text x="720" y="55" class="text" text-anchor="middle">Model Pipeline</text>
  <text x="720" y="73" class="text" text-anchor="middle">StandardScaler ➜ GaussianNB</text>

  <rect x="350" y="150" width="200" height="50" class="box"/>
  <text x="450" y="180" class="text" text-anchor="middle">Evaluation & Reporting</text>

  <!-- Arrows -->
  <path d="M170 70 L210 70" class="arrow" />
  <path d="M380 70 L410 70" class="arrow" />
  <path d="M580 70 L610 70" class="arrow" />
  <path d="M500 100 L450 150" class="arrow" />
  <path d="M450 200 L720 200" class="arrow" />

  <!-- Small captions -->
  <text x="120" y="35" class="text">1</text>
  <text x="295" y="10" class="text">2</text>
  <text x="495" y="35" class="text">3</text>
  <text x="720" y="10" class="text">4</text>

</svg>

Project overview
Quick start (Windows PowerShell)
Data layout and preparation
Notebook and example scripts
Modeling notes (Gaussian Naive Bayes)
Evaluation and experiment protocol
Reproducibility and environment
Code organization and recommended files
Contributing
License and citation
Further reading
Contact and next steps

Project overview

Purpose: provide a compact, well-documented baseline and learning resource for researchers, students, and practitioners who want to explore classical machine learning approaches for cell-level classification tasks in biomedical imaging. The focus is on clarity, reproducibility, and practical guidance rather than on providing large datasets or complex deep-learning pipelines.

Goals:

Explain common dataset layouts and minimal preprocessing required to run tabular or simple image-based experiments.
Provide a reproducible baseline pipeline using scikit-learn's GaussianNB with sensible preprocessing.
Document recommended evaluation metrics and experiment protocols for small and imbalanced datasets.
Offer an interactive notebook and a few scripts to help users get started quickly.

Quick start (Windows PowerShell)

These steps assume Windows PowerShell. Adjust for other shells if needed.

Create and activate a Python virtual environment (recommended Python 3.8+):

python -m venv .venv; .\\.venv\\Scripts\\Activate.ps1

Install core dependencies used by the examples and notebook:

pip install --upgrade pip
pip install numpy pandas scikit-learn matplotlib seaborn notebook joblib imbalanced-learn

Launch the interactive notebook (optional but recommended):

jupyter notebook notebook.ipynb

Notes: If you prefer VS Code, use the Jupyter integration there to open and run notebook.ipynb.

Data layout and preparation

This repository does not include private clinical datasets. Use your own data or public datasets and follow the layout below for reproducible experiments.

Recommended directory layout (local):

data/
  images/         # raw images (optional), or a folder per class depending on your loader
  labels.csv      # minimal CSV: id,filename,label,[optional metadata columns]
  features.csv    # optional: precomputed features for tabular experiments (id, feat_1, ..., label)

Minimal labels.csv example:

id,filename,label 1,cell_0001.png,benign 2,cell_0002.png,malignant

Guidelines and common preprocessing steps:

Tabular features: standardize numeric features (zero mean and unit variance) before applying GNB when feature scales differ.
Missing values: choose targeted imputation strategies (mean/median for continuous features, or model-based imputers) rather than blanket dropping if you have few samples.
Images: resize to a consistent shape; normalize pixel intensities; if using microscopy images, consider stain/illumination normalization and morphological feature extraction.
Imbalanced classes: prefer stratified splitting, and consider resampling (e.g., SMOTE) or thresholding strategies when reporting metrics.

See docs/dataset.md for more detailed notes and the notebook for example loaders and extractors.

Notebook and example scripts

notebook.ipynb: Guided walkthrough demonstrating data loading, simple feature extraction, training a GNB baseline with scikit-learn, cross-validation, and evaluation (ROC/PR, confusion matrix, classification report).
scripts/prepare_data.py: (recommended) create features.csv from raw images or raw tabular inputs.
scripts/train.py: (recommended) train a model and save a serialized pipeline (joblib) for reuse.
scripts/evaluate.py: (recommended) load a saved pipeline and compute test-set metrics and visualizations.

Example training pipeline (adapt in scripts or notebook):

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
pipe = make_pipeline(StandardScaler(), GaussianNB())
pipe.fit(X_train, y_train)
probs = pipe.predict_proba(X_test)[:, 1]
print('Test ROC AUC:', roc_auc_score(y_test, probs))

Modeling notes — Gaussian Naive Bayes (GNB)

Why GNB:

Fast, low-overhead baseline that works well for low-to-medium dimensional numeric features and small datasets.
Good pedagogical model to establish a performance floor before trying more complex algorithms.

When to reconsider GNB:

Strong feature correlation: covariance assumptions are violated and performance may suffer.
Highly non-Gaussian features: consider simple transforms (log, Box-Cox) or more flexible models (RandomForest, XGBoost, or neural embeddings).

Practical tips:

Always scale features before GNB if values have different units or magnitudes.
Use stratified CV and repeated runs to understand variance from small datasets.
If using image-derived features, consider using pretrained CNN embeddings as input features rather than raw pixels.

Evaluation and experiment protocol

Recommended metrics and practices for reporting:

Use ROC AUC for overall ranking performance and PR AUC for imbalanced problems.
Report confusion matrix, precision, recall (sensitivity), specificity, and F1-score.
Use stratified splits and nested CV for hyperparameter tuning when appropriate; otherwise use a held-out test set reserved until final evaluation.
Quantify uncertainty: report confidence intervals (bootstrap) or repeated CV statistics where feasible.

Suggested step-by-step experiment workflow:

Define the dataset and a reproducible split strategy (specify random seeds).
Preprocess and fit baseline models (GNB) on training folds only.
Tune hyperparameters using nested CV if you are tuning many model choices.
Evaluate final model on a single held-out test fold and report metrics with uncertainty estimates.

Reproducibility and environment

To make experiments reproducible, log and fix the following:

Python version (recommend 3.8+)
Package versions for critical libraries: numpy, pandas, scikit-learn, imbalanced-learn, matplotlib, joblib
Random seeds for numpy and scikit-learn (use random_state in API calls)

An example reproducibility snippet:

import numpy as np
np.random.seed(42)

If you want pinned dependencies, I can add a requirements.txt or a environment.yml for conda — tell me your preference (pip or conda).

Code organization and recommended files

Suggested layout for expanding the project:

src/               # project source code (data loaders, feature extraction, models, metrics)
scripts/           # small CLI scripts: prepare_data.py, train.py, evaluate.py
docs/              # user-facing documentation and tutorials
notebook.ipynb     # interactive exploration and examples
data/              # not committed: store local copies of datasets during experiments
models/            # saved model artifacts (git-ignored)

Naming and style recommendations:

Keep data loading and feature extraction separate from model code.
Use scikit-learn Pipelines for preprocessing + model so saved artifacts include all steps.
Add unit tests for small processing functions (feature extractors, CSV readers) as you expand the codebase.

Contributing

Contributions are welcome. A minimal contributor guideline:

Open an issue describing what you plan to change.
Create a branch and implement changes with small, testable commits.
Add or update documentation in docs/ for visible changes.
Open a pull request with a description and list of changes.

Code quality:

Follow PEP8 and add docstrings for public functions.
Keep functions small and add tests for data processing logic.

License and citation

This repository includes a LICENSE file in the project root. Please review it for reuse and distribution terms.

If you use these materials in research or teaching, include a brief citation in your method section. If you would like, I can draft a formal citation snippet.

Contact and next steps

If you want me to continue, I can do one of the following (choose one or ask for something else):

Create a pinned requirements.txt or environment.yml with versions used while authoring the docs (pip or conda).
Implement minimal scripts/train.py and scripts/evaluate.py that accept CLI arguments and are tested with a tiny synthetic dataset.
Run a markdown linter across docs/ and this README and fix style issues.

Please tell me which follow-up you'd like and I will implement it.

Contributors

This project was authored and maintained by:

NhanPhamThanh-IT — original author

Contributions welcome — please open an issue or send a PR. Add yourself to this list when you make a notable contribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

✨ GNB-Cancer-Cell-Classification

Highlights

Pipeline diagram

Table of contents

Project overview

Quick start (Windows PowerShell)

Data layout and preparation

Notebook and example scripts

Modeling notes — Gaussian Naive Bayes (GNB)

Evaluation and experiment protocol

Reproducibility and environment

Code organization and recommended files

Contributing

License and citation

Further reading

Contact and next steps

Contributors

About

Uh oh!

Releases 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
examples		examples
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt
setup.md		setup.md

License

NhanPhamThanh-IT/GNB-Cancer-Cell-Classification

Folders and files

Latest commit

History

Repository files navigation

✨ GNB-Cancer-Cell-Classification

Highlights

Pipeline diagram

Table of contents

Project overview

Quick start (Windows PowerShell)

Data layout and preparation

Notebook and example scripts

Modeling notes — Gaussian Naive Bayes (GNB)

Evaluation and experiment protocol

Reproducibility and environment

Code organization and recommended files

Contributing

License and citation

Further reading

Contact and next steps

Contributors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Languages