Skip to content

Reproducible deep learning pipelines for lung cancer detection using the public IQ-OTH/NCCD CT scan dataset. Includes modular Jupyter Notebooks, custom preprocessing, data split strategies (with/without CT order), and experiment tracking. Paper under review (PRICAI 2025).

License

Notifications You must be signed in to change notification settings

MyAiSpot/LungCancer_PRICAI2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Lung Cancer Detection with Deep Learning πŸ”

This repository accompanies the paper: "Data Splitting Bias in Public CT Datasets: Lessons from IQ-OTH/NCCD",

We provide a reproducible Deep Learning (DL) experimental pipelines for lung cancer detection using DL architectures. This includes custom data preprocessing strategies, data split considerations, multiple architecture designs, and rigorous model evaluation setups.


πŸ“š Table of Contents


πŸ“¦ Project Setup

  1. Install and launch Docker Desktop.
  2. Clone this repository:
    git clone https://github.com/your_username/LungCancer_PRICAI2025.git
    cd LungCancer_PRICAI2025
  3. Start JupyterLab via Docker Compose:
    docker compose --compatibility up
  4. Open your browser and go to: http://localhost:8000/lab
    • JupyterLab password: password

Build Time: The Docker image takes ~813 seconds to build (with 60 Mbps bandwidth).

πŸ“‚ Project Folder Structure & Naming Convention

The repository follows a modular folder structure:

.
β”œβ”€β”€ data/                # Contains original and preprocessed datasets
β”‚   └── 01_original/     # Original dataset (see Expected Raw Data Structure)
β”‚   └── {pipe_id}/       # Output from preprocessing notebooks
β”œβ”€β”€ notebooks/           # Main experimentation area
β”‚   └── {task ID}/       # Task-focused goal
β”‚       └── {exp ID}/    # Experiment-specific steps to reach task goal (.ipynb or .py)
β”‚           └── (Optional: quick viz, other)
β”œβ”€β”€ results/             # Stores all experiment tracking outputs
β”œβ”€β”€ env files            # Dockerfile, docker-compose.yml, requirements.txt
β”œβ”€β”€ README.md            # Project documentation

This organization allows independent prototyping under each task/experiment scope, enabling iterative and reproducible development.

  • Task ID: tsk01 β€” Focused on image-level detection of lung cancer labels using the IQ-OTH/NCCD dataset.
  • Experiment ID: exp01 β€” Compares the performance of data split strategies: with vs. without consideration of CT scan order during train-test splitting. This experiment also explores performance improvements across Vanilla 2DCNN, ResNet, and AlexNet architectures.

All notebooks follow the naming structure:

{AI DeveloperInitial}_{TaskID}_{ExperimentID}_{NotebookID}_{ShortDesc}_{VariationsID}.ipynb
  • Each notebook output will use the naming convention as id to track back the source code, e.g.:
    • {AI DeveloperInitial}_{TaskID}{ExperimentID}{NotebookID}{VariationsID}_Boxplot_name.png.

πŸ—ƒοΈ Expected Raw Data Structure

Download the dataset from Mendeley Data: IQ-OTH/NCCD Lung Cancer Dataset

Place the raw dataset in the following path:

data/01_original/IQ_OTHNCCD_LungCancer/IQ_OTHNCCD/
  β”œβ”€β”€ Bengin cases/
  β”œβ”€β”€ Malignant cases/
  └── Normal cases/

Image Count Barplot

CT scan counts per class and image size

πŸ“ˆ Initial Data Exploration

  • FR_t01e01nb01_initial_tasteExploration_of_raw_data_v1.ipynb

πŸ” Deep Learning Pipeline Overview

All steps are implemented in modular Jupyter Notebooks. The project uses a pragmatic structure for AI researchers and practitioners.

πŸ”§ Raw Data Preprocessing Pipelines

Each preprocessing pipeline loads raw data, transforms it, and saves the output as .npy files. Additionally, each pipeline generates a metadata file containing details about the processed .npy files for later use.

The following notebooks represent distinct preprocessing steps:

  • FR_t01e01nb02_pre-processing_pipeline_v1_pipe001_.ipynb
  • FR_t01e01nb02_pre-processing_pipeline_v2_pipe002.ipynb
  • FR_t01e01nb02_pre-processing_pipeline_v3_pipe003.ipynb
  • FR_t01e01nb02_pre-processing_pipeline_v4_pipe004.ipynb

pipe_output

Example of CT scan image after being loaded and transformed by each preprocessing pipeline

πŸ” Output images from each preprocessing strategy are illustrated within the notebooks.

πŸ”

πŸ—οΈ Architecture, Splitting & Training

Each notebook follows this inner sections structure:

  • Import Libraries
  • Config: Controls experiment ID, data paths.
  • Utility Functions: Includes the train() function which track DL experiments results.
  • Arch| Designs: Defines model (2DCNN, ResNet, AlexNet).
  • Training Steps: Includes subsection Training configuration with whole DL pipeline hyperparameter tuning.

πŸ“‚ Results Storage Format

After training, all outputs are saved to:

results/exp_track/TrainTrack_{id}/
  β”œβ”€β”€ ML_pipe_params/             # DL pipeline hyperparameters
  β”œβ”€β”€ models/                     # (Optional) Trained weights
  β”œβ”€β”€ performance_across_epochs/  # Metric logs for each epoch
  └── predictions/                # Final predictions for test/train sets

πŸ““ Training Notebooks

First Group β€” Single Stratified 5KfoldCV:

  • FR_t01e01nb03_Training_v1_vanillaCNN.ipynb
  • FR_t01e01nb03_Training_v2_ResNet.ipynb
  • FR_t01e01nb03_Training_v3_AlexNet.ipynb
    • Training hyperparameter to control conditions "Without and With Order" consideration of CT scans: pipe_params['split_strategy']['shuffle_instances']

Second Group β€” Repeated Stratified 5KfoldCV:

  • FR_t01e01nb04_Training_RepeatKFoldCV_v1_vanillaCNN.ipynb
  • FR_t01e01nb04_Training_RepeatKFoldCV_v2_ResNet.ipynb
    • Extends custom_StratifiedKFold() and Training Step to handle multiple repetitions of 5KfolCV

πŸ“Š Post-Performance Analysis

This flexible section supports identifying directions for improving:

  • Data preprocessing: CT scan image transformations, feature engeniering, feature selection, etc.
  • Training Strategy: Modifying/Controling the way that the DL Architecture learn patterns.
  • DL architecture: Modifying/Adding/Connecting layers, DL structures, etc to better fit the input data patterns.

All Post-Performance analisis can be found here:

  • FR_t01e01nb100_VIZ_Kfold_results_analysis_v1.ipynb

performance improvement

Based on the previous image results:

  • Condition "Without Order": Ignoring CT scan order during data splitting appears to inflate performance metrics.
  • Condition "With Order": When CT scan order is considered, the RESNET pipeline shows improved performance, achieving a mean Accuracy of 86%, with a 95% confidence interval ranging from 77% to 95% for unseen CT scan images that are similar to the lung cancer cases represented in the IQ-OTH/NCCD dataset.

🧱 Docker & Environment

Add Python Libraries (inside the container)

Open terminal in jupyter lab

pip install "your_library"
pwd  # ensure you are in /workspace
pip freeze | grep -v "feedstock_root" > requirements.txt

Base Image

Current: nvidia/cuda:12.2.2-runtime-ubuntu22.04

πŸ“Œ Notes on PyTorch + CUDA Compatibility

If you change the CUDA version, ensure PyTorch matches it. Example installation for CUDA 11.8:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
  • torch==2.7.0+cu118
  • torchvision==0.22.0+cu118
  • torchaudio==2.7.0+cu118

For any questions or issues, feel free to open an issue or contact the first author.

About

Reproducible deep learning pipelines for lung cancer detection using the public IQ-OTH/NCCD CT scan dataset. Includes modular Jupyter Notebooks, custom preprocessing, data split strategies (with/without CT order), and experiment tracking. Paper under review (PRICAI 2025).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published