Skip to content
This repository was archived by the owner on Aug 17, 2025. It is now read-only.

πŸ”¬ Reproducible sandbox for Gaussian Naive Bayes (GNB) applied to cancer cell classification β€” includes an interactive notebook, data layout and preprocessing guidance, feature-extraction tips, a lightweight scikit-learn pipeline, evaluation protocols for small/imbalanced biomedical datasets, and example scripts for prepare/train/evaluate.

License

Notifications You must be signed in to change notification settings

NhanPhamThanh-IT/GNB-Cancer-Cell-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

✨ GNB-Cancer-Cell-Classification

GitHub stars Issues License Last Commit Python

A compact, friendly, and reproducible sandbox for experimenting with Gaussian Naive Bayes (GNB) on cell-level classification tasks in biomedical imaging.

Why this repo? Because strong baselines and clear docs beat magic black boxes for small, sensitive datasets.

Highlights

  • πŸ”¬ Focused on cell-level classification tasks and small biomedical datasets
  • ⚑ Fast reproducible GNB baseline using scikit-learn Pipelines
  • 🧭 Clear guidance on data layout, preprocessing, evaluation, and uncertainty reporting
  • πŸ§ͺ Notebook-driven examples and suggested scripts for reproducible experiments
Quick summary β€” click to expand

This project provides:

  • A minimal, reproducible pipeline: data β†’ preprocessing β†’ features β†’ StandardScaler β†’ GaussianNB β†’ evaluation.
  • Jupyter notebook examples and recommended scripts for preparing data, training, and evaluation.
  • Practical advice for small and imbalanced biomedical datasets (stratified splits, transforms, SMOTE when appropriate).

Pipeline diagram

Below is a small visual overview of the recommended experiment flow. The SVG is included in the repo at assets/diagram-pipeline.svg; a PNG fallback (recommended for renderers that don't show SVG) is available at assets/diagram-pipeline.png.

Pipeline diagram (PNG fallback)

Raw SVG source:

<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" width="900" height="220" viewBox="0 0 900 220">
  <style>
    .box { fill: #f6f9ff; stroke: #3b82f6; stroke-width:2; rx:8; }
    .text { font-family: Arial, Helvetica, sans-serif; font-size:14px; fill:#0f172a }
    .title { font-weight:700; font-size:13px; }
    .arrow { stroke:#334155; stroke-width:2; fill:none; marker-end: url(#arrowhead);}
  </style>
  <defs>
    <marker id="arrowhead" markerWidth="10" markerHeight="7" refX="10" refY="3.5" orient="auto">
      <polygon points="0 0, 10 3.5, 0 7" fill="#334155" />
    </marker>
  </defs>

  <!-- Boxes -->
  <rect x="20" y="40" width="150" height="60" class="box"/>
  <text x="95" y="70" class="text" text-anchor="middle">Data</text>
  <text x="95" y="87" class="text" text-anchor="middle">(images / labels)</text>

  <rect x="210" y="20" width="170" height="100" class="box"/>
  <text x="295" y="55" class="text" text-anchor="middle">Preprocessing</text>
  <text x="295" y="73" class="text" text-anchor="middle">resize β€’ normalize β€’ impute</text>

  <rect x="410" y="40" width="170" height="60" class="box"/>
  <text x="495" y="70" class="text" text-anchor="middle">Feature Extraction</text>

  <rect x="610" y="20" width="220" height="100" class="box"/>
  <text x="720" y="55" class="text" text-anchor="middle">Model Pipeline</text>
  <text x="720" y="73" class="text" text-anchor="middle">StandardScaler ➜ GaussianNB</text>

  <rect x="350" y="150" width="200" height="50" class="box"/>
  <text x="450" y="180" class="text" text-anchor="middle">Evaluation & Reporting</text>

  <!-- Arrows -->
  <path d="M170 70 L210 70" class="arrow" />
  <path d="M380 70 L410 70" class="arrow" />
  <path d="M580 70 L610 70" class="arrow" />
  <path d="M500 100 L450 150" class="arrow" />
  <path d="M450 200 L720 200" class="arrow" />

  <!-- Small captions -->
  <text x="120" y="35" class="text">1</text>
  <text x="295" y="10" class="text">2</text>
  <text x="495" y="35" class="text">3</text>
  <text x="720" y="10" class="text">4</text>

</svg>

Table of contents

  • Project overview
  • Quick start (Windows PowerShell)
  • Data layout and preparation
  • Notebook and example scripts
  • Modeling notes (Gaussian Naive Bayes)
  • Evaluation and experiment protocol
  • Reproducibility and environment
  • Code organization and recommended files
  • Contributing
  • License and citation
  • Further reading
  • Contact and next steps

Project overview

Purpose: provide a compact, well-documented baseline and learning resource for researchers, students, and practitioners who want to explore classical machine learning approaches for cell-level classification tasks in biomedical imaging. The focus is on clarity, reproducibility, and practical guidance rather than on providing large datasets or complex deep-learning pipelines.

Goals:

  • Explain common dataset layouts and minimal preprocessing required to run tabular or simple image-based experiments.
  • Provide a reproducible baseline pipeline using scikit-learn's GaussianNB with sensible preprocessing.
  • Document recommended evaluation metrics and experiment protocols for small and imbalanced datasets.
  • Offer an interactive notebook and a few scripts to help users get started quickly.

Quick start (Windows PowerShell)

These steps assume Windows PowerShell. Adjust for other shells if needed.

  1. Create and activate a Python virtual environment (recommended Python 3.8+):
python -m venv .venv; .\\.venv\\Scripts\\Activate.ps1
  1. Install core dependencies used by the examples and notebook:
pip install --upgrade pip
pip install numpy pandas scikit-learn matplotlib seaborn notebook joblib imbalanced-learn
  1. Launch the interactive notebook (optional but recommended):
jupyter notebook notebook.ipynb

Notes: If you prefer VS Code, use the Jupyter integration there to open and run notebook.ipynb.

Data layout and preparation

This repository does not include private clinical datasets. Use your own data or public datasets and follow the layout below for reproducible experiments.

Recommended directory layout (local):

data/
  images/         # raw images (optional), or a folder per class depending on your loader
  labels.csv      # minimal CSV: id,filename,label,[optional metadata columns]
  features.csv    # optional: precomputed features for tabular experiments (id, feat_1, ..., label)

Minimal labels.csv example:

id,filename,label 1,cell_0001.png,benign 2,cell_0002.png,malignant

Guidelines and common preprocessing steps:

  • Tabular features: standardize numeric features (zero mean and unit variance) before applying GNB when feature scales differ.
  • Missing values: choose targeted imputation strategies (mean/median for continuous features, or model-based imputers) rather than blanket dropping if you have few samples.
  • Images: resize to a consistent shape; normalize pixel intensities; if using microscopy images, consider stain/illumination normalization and morphological feature extraction.
  • Imbalanced classes: prefer stratified splitting, and consider resampling (e.g., SMOTE) or thresholding strategies when reporting metrics.

See docs/dataset.md for more detailed notes and the notebook for example loaders and extractors.

Notebook and example scripts

  • notebook.ipynb: Guided walkthrough demonstrating data loading, simple feature extraction, training a GNB baseline with scikit-learn, cross-validation, and evaluation (ROC/PR, confusion matrix, classification report).
  • scripts/prepare_data.py: (recommended) create features.csv from raw images or raw tabular inputs.
  • scripts/train.py: (recommended) train a model and save a serialized pipeline (joblib) for reuse.
  • scripts/evaluate.py: (recommended) load a saved pipeline and compute test-set metrics and visualizations.

Example training pipeline (adapt in scripts or notebook):

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
pipe = make_pipeline(StandardScaler(), GaussianNB())
pipe.fit(X_train, y_train)
probs = pipe.predict_proba(X_test)[:, 1]
print('Test ROC AUC:', roc_auc_score(y_test, probs))

Modeling notes β€” Gaussian Naive Bayes (GNB)

Why GNB:

  • Fast, low-overhead baseline that works well for low-to-medium dimensional numeric features and small datasets.
  • Good pedagogical model to establish a performance floor before trying more complex algorithms.

When to reconsider GNB:

  • Strong feature correlation: covariance assumptions are violated and performance may suffer.
  • Highly non-Gaussian features: consider simple transforms (log, Box-Cox) or more flexible models (RandomForest, XGBoost, or neural embeddings).

Practical tips:

  • Always scale features before GNB if values have different units or magnitudes.
  • Use stratified CV and repeated runs to understand variance from small datasets.
  • If using image-derived features, consider using pretrained CNN embeddings as input features rather than raw pixels.

Evaluation and experiment protocol

Recommended metrics and practices for reporting:

  • Use ROC AUC for overall ranking performance and PR AUC for imbalanced problems.
  • Report confusion matrix, precision, recall (sensitivity), specificity, and F1-score.
  • Use stratified splits and nested CV for hyperparameter tuning when appropriate; otherwise use a held-out test set reserved until final evaluation.
  • Quantify uncertainty: report confidence intervals (bootstrap) or repeated CV statistics where feasible.

Suggested step-by-step experiment workflow:

  1. Define the dataset and a reproducible split strategy (specify random seeds).
  2. Preprocess and fit baseline models (GNB) on training folds only.
  3. Tune hyperparameters using nested CV if you are tuning many model choices.
  4. Evaluate final model on a single held-out test fold and report metrics with uncertainty estimates.

Reproducibility and environment

To make experiments reproducible, log and fix the following:

  • Python version (recommend 3.8+)
  • Package versions for critical libraries: numpy, pandas, scikit-learn, imbalanced-learn, matplotlib, joblib
  • Random seeds for numpy and scikit-learn (use random_state in API calls)

An example reproducibility snippet:

import numpy as np
np.random.seed(42)

If you want pinned dependencies, I can add a requirements.txt or a environment.yml for conda β€” tell me your preference (pip or conda).

Code organization and recommended files

Suggested layout for expanding the project:

src/               # project source code (data loaders, feature extraction, models, metrics)
scripts/           # small CLI scripts: prepare_data.py, train.py, evaluate.py
docs/              # user-facing documentation and tutorials
notebook.ipynb     # interactive exploration and examples
data/              # not committed: store local copies of datasets during experiments
models/            # saved model artifacts (git-ignored)

Naming and style recommendations:

  • Keep data loading and feature extraction separate from model code.
  • Use scikit-learn Pipelines for preprocessing + model so saved artifacts include all steps.
  • Add unit tests for small processing functions (feature extractors, CSV readers) as you expand the codebase.

Contributing

Contributions are welcome. A minimal contributor guideline:

  1. Open an issue describing what you plan to change.
  2. Create a branch and implement changes with small, testable commits.
  3. Add or update documentation in docs/ for visible changes.
  4. Open a pull request with a description and list of changes.

Code quality:

  • Follow PEP8 and add docstrings for public functions.
  • Keep functions small and add tests for data processing logic.

License and citation

This repository includes a LICENSE file in the project root. Please review it for reuse and distribution terms.

If you use these materials in research or teaching, include a brief citation in your method section. If you would like, I can draft a formal citation snippet.

Further reading

Contact and next steps

If you want me to continue, I can do one of the following (choose one or ask for something else):

  • Create a pinned requirements.txt or environment.yml with versions used while authoring the docs (pip or conda).
  • Implement minimal scripts/train.py and scripts/evaluate.py that accept CLI arguments and are tested with a tiny synthetic dataset.
  • Run a markdown linter across docs/ and this README and fix style issues.

Please tell me which follow-up you'd like and I will implement it.

Contributors

This project was authored and maintained by:

  • NhanPhamThanh-IT β€” original author

Contributions welcome β€” please open an issue or send a PR. Add yourself to this list when you make a notable contribution.

About

πŸ”¬ Reproducible sandbox for Gaussian Naive Bayes (GNB) applied to cancer cell classification β€” includes an interactive notebook, data layout and preprocessing guidance, feature-extraction tips, a lightweight scikit-learn pipeline, evaluation protocols for small/imbalanced biomedical datasets, and example scripts for prepare/train/evaluate.

Topics

Resources

License

Stars

Watchers

Forks