SFU CMPT 340 Project: Lung Cancer Early Detection Using Machine Learning (LCED)

This project aims to build a machine learning pipeline to predict the early onset of lung cancer based on publicly available medical datasets. The workflow covers the entire pipeline: data collection, preprocessing, feature selection, architecture, training, and evaluation.

Important Links

Timesheet	Slack channel	Project report

Video/demo/GIF

Youtube: | Demo |

2025_1_project_02/
├── LICENSE
├── README.md
├── requirements.txt                        # Full list of required packages for the whole project
├── driver.py
├── ct_scan_prediction/
│   ├── classification_model.ipynb          # Jupyter notebook to train and evaluate the CNN model
│   ├── classification_model.py             # Python script version of the notebook
│   ├── run_lung_cancer_model.py            # Loads trained model and predicts lung cancer class from input image
│   ├── final_trained_model_accuracy.txt    # Output log showing final test accuracy after model training
│   ├── requirements.txt                    # Auto-generated using pipreqs for only ct_scan_prediction dependencies
│   ├── test_input_images/                  # Folder containing test CT scan images
│   │   ├── test_benign.jpg
│   │   ├── test_malignant.jpg
│   │   └── test_normal.jpg
│   └── readme.md
├── data_analysis/
│   ├── __init__.py
│   ├── __pycache__/
│   ├── visualize.py                        # Plots histograms and heatmaps from optimized_lung_cancer_data.csv
│   └── visualize_feature_distribution.py   # Plots histograms for original Kaggle datasets
├── data_manipulation/
│   ├── __init__.py
│   ├── __pycache__/
│   ├── combine_datasets.py                 # Merges and cleans original datasets
│   ├── optimized_lung_cancer_data.csv      # Fully cleaned and optimized dataset
│   ├── questionaire.py                     # CLI-based user questionnaire and prediction demo
│   ├── readme.md
│   └── training_model.py                   # Logistic regression model training and evaluation
├── datasets/
│   ├── patientdata1_kaggle.csv             # Tabular dataset 1
│   ├── patientdata2_kaggle.csv             # Tabular dataset 2
│   └── image_dataset/
│       ├── Test cases/                     # Images for training object detection model (not used)
│       │   └── (multiple CT scan images)
│       └── The IQ-OTHNCCD lung cancer dataset/
│           └── (multiple CT scan images)   # Images for CNN model training
├── distribution_of_original_dataset_features/
│   └── (multiple histogram .png files)     # Histogram of feature distributions for imputation insights
└── optimized_csv_plots/
    ├── combined_histogram.png
    └── correlation_heatmap.png

2. Project Setup

This section provides the necessary steps to set up and the project environment on CSIL workstations or any local machine with Anaconda installed. The instructions ensure consistent versions and compatibility across platforms.

git clone https://github.com/sfu-cmpt340/2025_1_project_02.git
cd 2025_1_project_02

# Optional: Creating virtual environment
python3 -m venv venv
source venv/bin/activate 

# Install Project Dependencies (Does not include dependencies and model needed for python notebook in ct_scan_prediction):
pip install -r requirements.txt

# Run the project (Comment out functions if you would like to see executing functions one by one):
python driver.py

2.1 Libraries

Libraries used in this project:

scikit-learn, matplotlib, numpy, pandas, seaborn

2.2 Limitations

The study relied on datasets obtained from Kaggle, which may not comprehensively represent global patient populations or diverse demographic groups. This restricts the generalizability of the model to broader populations

Several features were dropped during preprocessing (e.g., Occupational Hazards, Anxiety, Peer Pressure), as they were deemed vague or redundant. This reduction in dimensionality might exclude potentially relevant predictors, impacting the model's comprehensiveness

3. Methods Overview

This section summarizes the key functions used throughout the project along with their responsibilities and output.

🔹 `original_features_histogram_maker()`

📍 data_analysis/visualize_feature_distribution.py
Description:

Loads the original Kaggle datasets
Standardizes and merges overlapping features
Scales numeric features
Generates histograms showing the distribution of each feature

Output:

Saves individual histograms to distribution_of_original_dataset_features/

🔹 `combining_datasets()`

📍 data_manipulation/combine_datasets.py
Description:

Merges patientdata1_kaggle.csv and patientdata2_kaggle.csv
Cleans and standardizes column names
Handles missing values with a custom ThresholdImputer (based on skewness)
Normalizes all numeric values with MinMaxScaler

Output:

Saves the cleaned and processed dataset as data_manipulation/optimized_lung_cancer_data.csv

🔹 `optimized_csv_histogram_maker()`

📍 data_analysis/visualize.py
Description:

Loads the optimized dataset
Splits data by lung cancer diagnosis (Yes/No)
Creates side-by-side grouped histograms for all features
Visualizes feature correlation with a heatmap

Output:

Combined histogram: optimized_csv_plots/combined_histogram.png
Correlation heatmap: optimized_csv_plots/correlation_heatmap.png

🔹 `create_classifier()`

📍 data_manipulation/training_model.py
Description:

Loads the optimized dataset
Constructs a pipeline for preprocessing and classification using Logistic Regression
Trains the model and evaluates it using accuracy, confusion matrix, and classification report
Performs 5-fold cross-validation

Output:

Prints training/testing scores and model evaluation in terminal
Returns trained model pipeline clf

🔹 `input_predictor()`

📍 data_manipulation/questionaire.py
Description:

Launches a CLI-based questionnaire to collect user health information
Normalizes and formats inputs for prediction
Uses trained model (create_classifier()) to predict lung cancer risk

Output:

Prints a human-readable diagnosis and probability estimate in terminal

Bonus: Project Overview — CT Scan Image Classification

In addition to analyzing tabular patient data, this project includes an image-based approach to lung cancer detection using a Convolutional Neural Network (CNN).

The notebook classification_model.ipynb (in ct_scan_prediction/) walks through the full image classification pipeline using CT scan data.

Objectives

Load and preprocess the IQ-OTHNCCD lung cancer dataset
Apply image augmentation (flipping, brightness/contrast, rotation)
Build and train a CNN model using DenseNet121 via Keras and TensorFlow
Evaluate the model on validation data using accuracy, precision, and recall
Save the trained model to .h5 and store final accuracy in a log file
Visualize predictions and training metrics directly in the notebook

Requirements

Make sure the following libraries are installed:

tensorflow
keras
numpy
matplotlib
scikit-learn
Pillow
Augmentor

Also ensure:

The IQ-OTHNCCD image dataset is placed inside datasets/image_dataset/
You have mounted Google Drive if running in Colab

Setup/Execution

Ensure you're working in a virtual environment or a Conda environment install all dependencies and necessary libraries with:

pip install -r requirements.txt

If using Google Colab, mount Google Drive using google.colab import drive followed by drive.mount('/content/drive').

Then, place the IQ-OTHNCCD lung cancer dataset in your Drive under

/MyDrive/image_dataset/The IQ-OTHNCCD lung cancer dataset/

with the following subfolders: Benign cases/, Malignant cases/, and Normal cases/. The overview of directories should look like:

└──MyDrive/
   └── image_dataset/
      └── The IQ-OTHNCCD lung cancer dataset/
          ├── Benign cases
          ├── Malignant cases
          └── Normal cases

If running the project outside of Colab, be sure to update all file paths in the scripts (e.g., paths starting with /content/ or /MyDrive/) to match local environment.

Afterward, execute the model

python classification_model.py

This performs data augmentation, renames and balances the image datasets, splits the data into training and validation sets, and trains a CNN model based on DenseNet121. The trained model will be saved to

/MyDrive/lung_cancer_classification_model.h5.

To evaluate the model, run

python run_lung_cancer_model.py

Make sure the model file exists at the specified path and that test_img_path in the script points to the CT scan image you want to classify.

Output

Trained CNN model saved to: /MyDrive/lung_cancer_classification_model.h5
Final accuracy log: ct_scan_prediction/final_trained_model_accuracy.txt
Prediction tool: Use run_lung_cancer_model.py to test individual CT scan images with the trained model to receive class prediction which is either "benign", "normal" or "malignant".

Example Prediction

python run_lung_cancer_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SFU CMPT 340 Project: Lung Cancer Early Detection Using Machine Learning (LCED)

Important Links

Video/demo/GIF

Table of Contents

1. File Directory infrastructure

2. Project Setup

2.1 Libraries

2.2 Limitations

3. Methods Overview

🔹 `original_features_histogram_maker()`

🔹 `combining_datasets()`

🔹 `optimized_csv_histogram_maker()`

🔹 `create_classifier()`

🔹 `input_predictor()`

Bonus: Project Overview — CT Scan Image Classification

Objectives

Requirements

Setup/Execution

Make sure the model file exists at the specified path and that test_img_path in the script points to the CT scan image you want to classify.

Output

Example Prediction

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
ct_scan_prediction		ct_scan_prediction
data_analysis		data_analysis
data_manipulation		data_manipulation
datasets		datasets
distribution_of_original_dataset_features		distribution_of_original_dataset_features
optimized_csv_plots		optimized_csv_plots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
driver.py		driver.py
requirements.txt		requirements.txt

License

sfu-cmpt340/2025_1_project_02

Folders and files

Latest commit

History

Repository files navigation

SFU CMPT 340 Project: Lung Cancer Early Detection Using Machine Learning (LCED)

Important Links

Video/demo/GIF

Table of Contents

1. File Directory infrastructure

2. Project Setup

2.1 Libraries

2.2 Limitations

3. Methods Overview

🔹 original_features_histogram_maker()

🔹 combining_datasets()

🔹 optimized_csv_histogram_maker()

🔹 create_classifier()

🔹 input_predictor()

Bonus: Project Overview — CT Scan Image Classification

Objectives

Requirements

Setup/Execution

Make sure the model file exists at the specified path and that test_img_path in the script points to the CT scan image you want to classify.

Output

Example Prediction

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

🔹 `original_features_histogram_maker()`

🔹 `combining_datasets()`

🔹 `optimized_csv_histogram_maker()`

🔹 `create_classifier()`

🔹 `input_predictor()`

Packages