Skip to content

sfu-cmpt340/2025_1_project_02

Repository files navigation

SFU CMPT 340 Project: Lung Cancer Early Detection Using Machine Learning (LCED)

This project aims to build a machine learning pipeline to predict the early onset of lung cancer based on publicly available medical datasets. The workflow covers the entire pipeline: data collection, preprocessing, feature selection, architecture, training, and evaluation.

Important Links

Timesheet Slack channel Project report

Video/demo/GIF

Youtube: | Demo |

Table of Contents

  1. File Directory Infrastructure
  2. Project Setup
  3. Methods Overview
  4. Bonus: Project Overview – CT Scan Image Classification (Notebook)

1. File Directory infrastructure

Explain briefly what files are found where

2025_1_project_02/
├── LICENSE
├── README.md
├── requirements.txt                        # Full list of required packages for the whole project
├── driver.py
├── ct_scan_prediction/
│   ├── classification_model.ipynb          # Jupyter notebook to train and evaluate the CNN model
│   ├── classification_model.py             # Python script version of the notebook
│   ├── run_lung_cancer_model.py            # Loads trained model and predicts lung cancer class from input image
│   ├── final_trained_model_accuracy.txt    # Output log showing final test accuracy after model training
│   ├── requirements.txt                    # Auto-generated using pipreqs for only ct_scan_prediction dependencies
│   ├── test_input_images/                  # Folder containing test CT scan images
│   │   ├── test_benign.jpg
│   │   ├── test_malignant.jpg
│   │   └── test_normal.jpg
│   └── readme.md
├── data_analysis/
│   ├── __init__.py
│   ├── __pycache__/
│   ├── visualize.py                        # Plots histograms and heatmaps from optimized_lung_cancer_data.csv
│   └── visualize_feature_distribution.py   # Plots histograms for original Kaggle datasets
├── data_manipulation/
│   ├── __init__.py
│   ├── __pycache__/
│   ├── combine_datasets.py                 # Merges and cleans original datasets
│   ├── optimized_lung_cancer_data.csv      # Fully cleaned and optimized dataset
│   ├── questionaire.py                     # CLI-based user questionnaire and prediction demo
│   ├── readme.md
│   └── training_model.py                   # Logistic regression model training and evaluation
├── datasets/
│   ├── patientdata1_kaggle.csv             # Tabular dataset 1
│   ├── patientdata2_kaggle.csv             # Tabular dataset 2
│   └── image_dataset/
│       ├── Test cases/                     # Images for training object detection model (not used)
│       │   └── (multiple CT scan images)
│       └── The IQ-OTHNCCD lung cancer dataset/
│           └── (multiple CT scan images)   # Images for CNN model training
├── distribution_of_original_dataset_features/
│   └── (multiple histogram .png files)     # Histogram of feature distributions for imputation insights
└── optimized_csv_plots/
    ├── combined_histogram.png
    └── correlation_heatmap.png





2. Project Setup

This section provides the necessary steps to set up and the project environment on CSIL workstations or any local machine with Anaconda installed. The instructions ensure consistent versions and compatibility across platforms.

git clone https://github.com/sfu-cmpt340/2025_1_project_02.git
cd 2025_1_project_02

# Optional: Creating virtual environment
python3 -m venv venv
source venv/bin/activate 

# Install Project Dependencies (Does not include dependencies and model needed for python notebook in ct_scan_prediction):
pip install -r requirements.txt

# Run the project (Comment out functions if you would like to see executing functions one by one):
python driver.py

2.1 Libraries

Libraries used in this project:

scikit-learn, matplotlib, numpy, pandas, seaborn

2.2 Limitations

The study relied on datasets obtained from Kaggle, which may not comprehensively represent global patient populations or diverse demographic groups. This restricts the generalizability of the model to broader populations

Several features were dropped during preprocessing (e.g., Occupational Hazards, Anxiety, Peer Pressure), as they were deemed vague or redundant. This reduction in dimensionality might exclude potentially relevant predictors, impacting the model's comprehensiveness

3. Methods Overview

This section summarizes the key functions used throughout the project along with their responsibilities and output.


🔹 original_features_histogram_maker()

📍 data_analysis/visualize_feature_distribution.py
Description:

  • Loads the original Kaggle datasets
  • Standardizes and merges overlapping features
  • Scales numeric features
  • Generates histograms showing the distribution of each feature

Output:

  • Saves individual histograms to distribution_of_original_dataset_features/

🔹 combining_datasets()

📍 data_manipulation/combine_datasets.py
Description:

  • Merges patientdata1_kaggle.csv and patientdata2_kaggle.csv
  • Cleans and standardizes column names
  • Handles missing values with a custom ThresholdImputer (based on skewness)
  • Normalizes all numeric values with MinMaxScaler

Output:

  • Saves the cleaned and processed dataset as data_manipulation/optimized_lung_cancer_data.csv

🔹 optimized_csv_histogram_maker()

📍 data_analysis/visualize.py
Description:

  • Loads the optimized dataset
  • Splits data by lung cancer diagnosis (Yes/No)
  • Creates side-by-side grouped histograms for all features
  • Visualizes feature correlation with a heatmap

Output:

  • Combined histogram: optimized_csv_plots/combined_histogram.png
  • Correlation heatmap: optimized_csv_plots/correlation_heatmap.png

🔹 create_classifier()

📍 data_manipulation/training_model.py
Description:

  • Loads the optimized dataset
  • Constructs a pipeline for preprocessing and classification using Logistic Regression
  • Trains the model and evaluates it using accuracy, confusion matrix, and classification report
  • Performs 5-fold cross-validation

Output:

  • Prints training/testing scores and model evaluation in terminal
  • Returns trained model pipeline clf

🔹 input_predictor()

📍 data_manipulation/questionaire.py
Description:

  • Launches a CLI-based questionnaire to collect user health information
  • Normalizes and formats inputs for prediction
  • Uses trained model (create_classifier()) to predict lung cancer risk

Output:

  • Prints a human-readable diagnosis and probability estimate in terminal

Bonus: Project Overview — CT Scan Image Classification

In addition to analyzing tabular patient data, this project includes an image-based approach to lung cancer detection using a Convolutional Neural Network (CNN).

The notebook classification_model.ipynb (in ct_scan_prediction/) walks through the full image classification pipeline using CT scan data.


Objectives

  • Load and preprocess the IQ-OTHNCCD lung cancer dataset
  • Apply image augmentation (flipping, brightness/contrast, rotation)
  • Build and train a CNN model using DenseNet121 via Keras and TensorFlow
  • Evaluate the model on validation data using accuracy, precision, and recall
  • Save the trained model to .h5 and store final accuracy in a log file
  • Visualize predictions and training metrics directly in the notebook

Requirements

Make sure the following libraries are installed:

  • tensorflow
  • keras
  • numpy
  • matplotlib
  • scikit-learn
  • Pillow
  • Augmentor

Also ensure:

  • The IQ-OTHNCCD image dataset is placed inside datasets/image_dataset/
  • You have mounted Google Drive if running in Colab

Setup/Execution

Ensure you're working in a virtual environment or a Conda environment install all dependencies and necessary libraries with:

pip install -r requirements.txt

If using Google Colab, mount Google Drive using google.colab import drive followed by drive.mount('/content/drive').

Then, place the IQ-OTHNCCD lung cancer dataset in your Drive under

/MyDrive/image_dataset/The IQ-OTHNCCD lung cancer dataset/

with the following subfolders: Benign cases/, Malignant cases/, and Normal cases/. The overview of directories should look like:

└──MyDrive/
   └── image_dataset/
      └── The IQ-OTHNCCD lung cancer dataset/
          ├── Benign cases
          ├── Malignant cases
          └── Normal cases

If running the project outside of Colab, be sure to update all file paths in the scripts (e.g., paths starting with /content/ or /MyDrive/) to match local environment.

Afterward, execute the model

python classification_model.py

This performs data augmentation, renames and balances the image datasets, splits the data into training and validation sets, and trains a CNN model based on DenseNet121. The trained model will be saved to

/MyDrive/lung_cancer_classification_model.h5.

To evaluate the model, run

python run_lung_cancer_model.py

Make sure the model file exists at the specified path and that test_img_path in the script points to the CT scan image you want to classify.

Output

  • Trained CNN model saved to: /MyDrive/lung_cancer_classification_model.h5

  • Final accuracy log: ct_scan_prediction/final_trained_model_accuracy.txt

  • Prediction tool: Use run_lung_cancer_model.py to test individual CT scan images with the trained model to receive class prediction which is either "benign", "normal" or "malignant".


Example Prediction

python run_lung_cancer_model.py

About

This is your first repository

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6