Lung Cancer Detection with Deep Learning 🔝

This repository accompanies the paper: "Data Splitting Bias in Public CT Datasets: Lessons from IQ-OTH/NCCD",

We provide a reproducible Deep Learning (DL) experimental pipelines for lung cancer detection using DL architectures. This includes custom data preprocessing strategies, data split considerations, multiple architecture designs, and rigorous model evaluation setups.

📚 Table of Contents

📦 Project Setup
📂 Project Folder Structure & Naming Convention
🗃️ Expected Raw Data Structure
📈 Initial Data Exploration
🔁 Deep Learning Pipeline Overview
📊 Post-Performance Analysis
🧱 Docker & Environment
📌 Notes on PyTorch + CUDA Compatibility

📦 Project Setup

Install and launch Docker Desktop.

Clone this repository:

git clone https://github.com/your_username/LungCancer_PRICAI2025.git
cd LungCancer_PRICAI2025

Start JupyterLab via Docker Compose:
```
docker compose --compatibility up
```
Open your browser and go to: http://localhost:8000/lab
- JupyterLab password: password

Build Time: The Docker image takes ~813 seconds to build (with 60 Mbps bandwidth).

🔝

📂 Project Folder Structure & Naming Convention

The repository follows a modular folder structure:

.
├── data/                # Contains original and preprocessed datasets
│   └── 01_original/     # Original dataset (see Expected Raw Data Structure)
│   └── {pipe_id}/       # Output from preprocessing notebooks
├── notebooks/           # Main experimentation area
│   └── {task ID}/       # Task-focused goal
│       └── {exp ID}/    # Experiment-specific steps to reach task goal (.ipynb or .py)
│           └── (Optional: quick viz, other)
├── results/             # Stores all experiment tracking outputs
├── env files            # Dockerfile, docker-compose.yml, requirements.txt
├── README.md            # Project documentation

This organization allows independent prototyping under each task/experiment scope, enabling iterative and reproducible development.

Task ID: tsk01 — Focused on image-level detection of lung cancer labels using the IQ-OTH/NCCD dataset.
Experiment ID: exp01 — Compares the performance of data split strategies: with vs. without consideration of CT scan order during train-test splitting. This experiment also explores performance improvements across Vanilla 2DCNN, ResNet, and AlexNet architectures.

All notebooks follow the naming structure:

{AI DeveloperInitial}_{TaskID}_{ExperimentID}_{NotebookID}_{ShortDesc}_{VariationsID}.ipynb

Each notebook output will use the naming convention as id to track back the source code, e.g.:
- {AI DeveloperInitial}_{TaskID}{ExperimentID}{NotebookID}{VariationsID}_Boxplot_name.png.

🔝

🗃️ Expected Raw Data Structure

Download the dataset from Mendeley Data: IQ-OTH/NCCD Lung Cancer Dataset

Place the raw dataset in the following path:

data/01_original/IQ_OTHNCCD_LungCancer/IQ_OTHNCCD/
  ├── Bengin cases/
  ├── Malignant cases/
  └── Normal cases/

CT scan counts per class and image size

🔝

📈 Initial Data Exploration

FR_t01e01nb01_initial_tasteExploration_of_raw_data_v1.ipynb

🔝

🔁 Deep Learning Pipeline Overview

All steps are implemented in modular Jupyter Notebooks. The project uses a pragmatic structure for AI researchers and practitioners.

🔧 Raw Data Preprocessing Pipelines

Each preprocessing pipeline loads raw data, transforms it, and saves the output as .npy files. Additionally, each pipeline generates a metadata file containing details about the processed .npy files for later use.

The following notebooks represent distinct preprocessing steps:

FR_t01e01nb02_pre-processing_pipeline_v1_pipe001_.ipynb
FR_t01e01nb02_pre-processing_pipeline_v2_pipe002.ipynb
FR_t01e01nb02_pre-processing_pipeline_v3_pipe003.ipynb
FR_t01e01nb02_pre-processing_pipeline_v4_pipe004.ipynb

Example of CT scan image after being loaded and transformed by each preprocessing pipeline

🔍 Output images from each preprocessing strategy are illustrated within the notebooks.

🔝

🏗️ Architecture, Splitting & Training

Each notebook follows this inner sections structure:

Import Libraries
Config: Controls experiment ID, data paths.
Utility Functions: Includes the train() function which track DL experiments results.
Arch| Designs: Defines model (2DCNN, ResNet, AlexNet).
Training Steps: Includes subsection Training configuration with whole DL pipeline hyperparameter tuning.

📂 Results Storage Format

After training, all outputs are saved to:

results/exp_track/TrainTrack_{id}/
  ├── ML_pipe_params/             # DL pipeline hyperparameters
  ├── models/                     # (Optional) Trained weights
  ├── performance_across_epochs/  # Metric logs for each epoch
  └── predictions/                # Final predictions for test/train sets

📓 Training Notebooks

First Group — Single Stratified 5KfoldCV:

FR_t01e01nb03_Training_v1_vanillaCNN.ipynb
FR_t01e01nb03_Training_v2_ResNet.ipynb
FR_t01e01nb03_Training_v3_AlexNet.ipynb
- Training hyperparameter to control conditions "Without and With Order" consideration of CT scans: pipe_params['split_strategy']['shuffle_instances']

Second Group — Repeated Stratified 5KfoldCV:

FR_t01e01nb04_Training_RepeatKFoldCV_v1_vanillaCNN.ipynb
FR_t01e01nb04_Training_RepeatKFoldCV_v2_ResNet.ipynb
- Extends custom_StratifiedKFold() and Training Step to handle multiple repetitions of 5KfolCV

🔝

📊 Post-Performance Analysis

This flexible section supports identifying directions for improving:

Data preprocessing: CT scan image transformations, feature engeniering, feature selection, etc.
Training Strategy: Modifying/Controling the way that the DL Architecture learn patterns.
DL architecture: Modifying/Adding/Connecting layers, DL structures, etc to better fit the input data patterns.

All Post-Performance analisis can be found here:

FR_t01e01nb100_VIZ_Kfold_results_analysis_v1.ipynb

Based on the previous image results:

Condition "Without Order": Ignoring CT scan order during data splitting appears to inflate performance metrics.
Condition "With Order": When CT scan order is considered, the RESNET pipeline shows improved performance, achieving a mean Accuracy of 86%, with a 95% confidence interval ranging from 77% to 95% for unseen CT scan images that are similar to the lung cancer cases represented in the IQ-OTH/NCCD dataset.

🔝

🧱 Docker & Environment

Add Python Libraries (inside the container)

Open terminal in jupyter lab

pip install "your_library"
pwd  # ensure you are in /workspace
pip freeze | grep -v "feedstock_root" > requirements.txt

Base Image

Current: nvidia/cuda:12.2.2-runtime-ubuntu22.04

🔝

📌 Notes on PyTorch + CUDA Compatibility

If you change the CUDA version, ensure PyTorch matches it. Example installation for CUDA 11.8:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

torch==2.7.0+cu118
torchvision==0.22.0+cu118
torchaudio==2.7.0+cu118

🔝

For any questions or issues, feel free to open an issue or contact the first author.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
_img		_img
notebook/tsk01/exp01		notebook/tsk01/exp01
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lung Cancer Detection with Deep Learning 🔝

📚 Table of Contents

📦 Project Setup

🔝

📂 Project Folder Structure & Naming Convention

🔝

🗃️ Expected Raw Data Structure

🔝

📈 Initial Data Exploration

🔝

🔁 Deep Learning Pipeline Overview

🔧 Raw Data Preprocessing Pipelines

🏗️ Architecture, Splitting & Training

📂 Results Storage Format

📓 Training Notebooks

First Group — Single Stratified 5KfoldCV:

Second Group — Repeated Stratified 5KfoldCV:

🔝

📊 Post-Performance Analysis

🔝

🧱 Docker & Environment

Add Python Libraries (inside the container)

Base Image

🔝

📌 Notes on PyTorch + CUDA Compatibility

🔝

About

Uh oh!

Releases

Packages

Languages

License

MyAiSpot/LungCancer_PRICAI2025

Folders and files

Latest commit

History

Repository files navigation

Lung Cancer Detection with Deep Learning 🔝

📚 Table of Contents

📦 Project Setup

📂 Project Folder Structure & Naming Convention

🗃️ Expected Raw Data Structure

📈 Initial Data Exploration

🔁 Deep Learning Pipeline Overview

🔧 Raw Data Preprocessing Pipelines

🏗️ Architecture, Splitting & Training

📂 Results Storage Format

📓 Training Notebooks

First Group — Single Stratified 5KfoldCV:

Second Group — Repeated Stratified 5KfoldCV:

📊 Post-Performance Analysis

🧱 Docker & Environment

Add Python Libraries (inside the container)

Base Image

📌 Notes on PyTorch + CUDA Compatibility

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages