Lung Cancer Detection with Deep Learning π
This repository accompanies the paper: "Data Splitting Bias in Public CT Datasets: Lessons from IQ-OTH/NCCD",
We provide a reproducible Deep Learning (DL) experimental pipelines for lung cancer detection using DL architectures. This includes custom data preprocessing strategies, data split considerations, multiple architecture designs, and rigorous model evaluation setups.
- π¦ Project Setup
- π Project Folder Structure & Naming Convention
- ποΈ Expected Raw Data Structure
- π Initial Data Exploration
- π Deep Learning Pipeline Overview
- π Post-Performance Analysis
- π§± Docker & Environment
- π Notes on PyTorch + CUDA Compatibility
- Install and launch Docker Desktop.
- Clone this repository:
git clone https://github.com/your_username/LungCancer_PRICAI2025.git cd LungCancer_PRICAI2025
- Start JupyterLab via Docker Compose:
docker compose --compatibility up
- Open your browser and go to: http://localhost:8000/lab
- JupyterLab password:
password
- JupyterLab password:
Build Time: The Docker image takes ~813 seconds to build (with 60 Mbps bandwidth).
The repository follows a modular folder structure:
.
βββ data/ # Contains original and preprocessed datasets
β βββ 01_original/ # Original dataset (see Expected Raw Data Structure)
β βββ {pipe_id}/ # Output from preprocessing notebooks
βββ notebooks/ # Main experimentation area
β βββ {task ID}/ # Task-focused goal
β βββ {exp ID}/ # Experiment-specific steps to reach task goal (.ipynb or .py)
β βββ (Optional: quick viz, other)
βββ results/ # Stores all experiment tracking outputs
βββ env files # Dockerfile, docker-compose.yml, requirements.txt
βββ README.md # Project documentation
This organization allows independent prototyping under each task/experiment scope, enabling iterative and reproducible development.
- Task ID:
tsk01
β Focused on image-level detection of lung cancer labels using the IQ-OTH/NCCD dataset. - Experiment ID:
exp01
β Compares the performance of data split strategies: with vs. without consideration of CT scan order during train-test splitting. This experiment also explores performance improvements across Vanilla 2DCNN, ResNet, and AlexNet architectures.
All notebooks follow the naming structure:
{AI DeveloperInitial}_{TaskID}_{ExperimentID}_{NotebookID}_{ShortDesc}_{VariationsID}.ipynb
- Each notebook output will use the naming convention as id to track back the source code, e.g.:
{AI DeveloperInitial}_{TaskID}{ExperimentID}{NotebookID}{VariationsID}_Boxplot_name.png
.
Download the dataset from Mendeley Data: IQ-OTH/NCCD Lung Cancer Dataset
Place the raw dataset in the following path:
data/01_original/IQ_OTHNCCD_LungCancer/IQ_OTHNCCD/
βββ Bengin cases/
βββ Malignant cases/
βββ Normal cases/
CT scan counts per class and image size
FR_t01e01nb01_initial_tasteExploration_of_raw_data_v1.ipynb
All steps are implemented in modular Jupyter Notebooks. The project uses a pragmatic structure for AI researchers and practitioners.
Each preprocessing pipeline loads raw data, transforms it, and saves the output as .npy
files. Additionally, each pipeline generates a metadata file containing details about the processed .npy files for later use.
The following notebooks represent distinct preprocessing steps:
FR_t01e01nb02_pre-processing_pipeline_v1_pipe001_.ipynb
FR_t01e01nb02_pre-processing_pipeline_v2_pipe002.ipynb
FR_t01e01nb02_pre-processing_pipeline_v3_pipe003.ipynb
FR_t01e01nb02_pre-processing_pipeline_v4_pipe004.ipynb
Example of CT scan image after being loaded and transformed by each preprocessing pipeline
π Output images from each preprocessing strategy are illustrated within the notebooks.
Each notebook follows this inner sections structure:
- Import Libraries
- Config: Controls experiment ID, data paths.
- Utility Functions: Includes the
train()
function which track DL experiments results. - Arch| Designs: Defines model (2DCNN, ResNet, AlexNet).
- Training Steps: Includes subsection
Training configuration
with whole DL pipeline hyperparameter tuning.
After training, all outputs are saved to:
results/exp_track/TrainTrack_{id}/
βββ ML_pipe_params/ # DL pipeline hyperparameters
βββ models/ # (Optional) Trained weights
βββ performance_across_epochs/ # Metric logs for each epoch
βββ predictions/ # Final predictions for test/train sets
FR_t01e01nb03_Training_v1_vanillaCNN.ipynb
FR_t01e01nb03_Training_v2_ResNet.ipynb
FR_t01e01nb03_Training_v3_AlexNet.ipynb
- Training hyperparameter to control conditions "Without and With Order" consideration of CT scans:
pipe_params['split_strategy']['shuffle_instances']
- Training hyperparameter to control conditions "Without and With Order" consideration of CT scans:
FR_t01e01nb04_Training_RepeatKFoldCV_v1_vanillaCNN.ipynb
FR_t01e01nb04_Training_RepeatKFoldCV_v2_ResNet.ipynb
- Extends
custom_StratifiedKFold()
and Training Step to handle multiple repetitions of 5KfolCV
- Extends
This flexible section supports identifying directions for improving:
- Data preprocessing: CT scan image transformations, feature engeniering, feature selection, etc.
- Training Strategy: Modifying/Controling the way that the DL Architecture learn patterns.
- DL architecture: Modifying/Adding/Connecting layers, DL structures, etc to better fit the input data patterns.
All Post-Performance analisis can be found here:
FR_t01e01nb100_VIZ_Kfold_results_analysis_v1.ipynb
Based on the previous image results:
- Condition "Without Order": Ignoring CT scan order during data splitting appears to inflate performance metrics.
- Condition "With Order": When CT scan order is considered, the RESNET pipeline shows improved performance, achieving a mean Accuracy of 86%, with a 95% confidence interval ranging from 77% to 95% for unseen CT scan images that are similar to the lung cancer cases represented in the IQ-OTH/NCCD dataset.
Open terminal in jupyter lab
pip install "your_library"
pwd # ensure you are in /workspace
pip freeze | grep -v "feedstock_root" > requirements.txt
Current: nvidia/cuda:12.2.2-runtime-ubuntu22.04
If you change the CUDA version, ensure PyTorch matches it. Example installation for CUDA 11.8:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
torch==2.7.0+cu118
torchvision==0.22.0+cu118
torchaudio==2.7.0+cu118
For any questions or issues, feel free to open an issue or contact the first author.