This project aims to build a machine learning pipeline to predict the early onset of lung cancer based on publicly available medical datasets. The workflow covers the entire pipeline: data collection, preprocessing, feature selection, architecture, training, and evaluation.
Timesheet | Slack channel | Project report |
---|
Youtube: | Demo |
- File Directory Infrastructure
- Project Setup
- Methods Overview
- Bonus: Project Overview – CT Scan Image Classification (Notebook)
Explain briefly what files are found where
2025_1_project_02/
├── LICENSE
├── README.md
├── requirements.txt # Full list of required packages for the whole project
├── driver.py
├── ct_scan_prediction/
│ ├── classification_model.ipynb # Jupyter notebook to train and evaluate the CNN model
│ ├── classification_model.py # Python script version of the notebook
│ ├── run_lung_cancer_model.py # Loads trained model and predicts lung cancer class from input image
│ ├── final_trained_model_accuracy.txt # Output log showing final test accuracy after model training
│ ├── requirements.txt # Auto-generated using pipreqs for only ct_scan_prediction dependencies
│ ├── test_input_images/ # Folder containing test CT scan images
│ │ ├── test_benign.jpg
│ │ ├── test_malignant.jpg
│ │ └── test_normal.jpg
│ └── readme.md
├── data_analysis/
│ ├── __init__.py
│ ├── __pycache__/
│ ├── visualize.py # Plots histograms and heatmaps from optimized_lung_cancer_data.csv
│ └── visualize_feature_distribution.py # Plots histograms for original Kaggle datasets
├── data_manipulation/
│ ├── __init__.py
│ ├── __pycache__/
│ ├── combine_datasets.py # Merges and cleans original datasets
│ ├── optimized_lung_cancer_data.csv # Fully cleaned and optimized dataset
│ ├── questionaire.py # CLI-based user questionnaire and prediction demo
│ ├── readme.md
│ └── training_model.py # Logistic regression model training and evaluation
├── datasets/
│ ├── patientdata1_kaggle.csv # Tabular dataset 1
│ ├── patientdata2_kaggle.csv # Tabular dataset 2
│ └── image_dataset/
│ ├── Test cases/ # Images for training object detection model (not used)
│ │ └── (multiple CT scan images)
│ └── The IQ-OTHNCCD lung cancer dataset/
│ └── (multiple CT scan images) # Images for CNN model training
├── distribution_of_original_dataset_features/
│ └── (multiple histogram .png files) # Histogram of feature distributions for imputation insights
└── optimized_csv_plots/
├── combined_histogram.png
└── correlation_heatmap.png
This section provides the necessary steps to set up and the project environment on CSIL workstations or any local machine with Anaconda installed. The instructions ensure consistent versions and compatibility across platforms.
git clone https://github.com/sfu-cmpt340/2025_1_project_02.git
cd 2025_1_project_02
# Optional: Creating virtual environment
python3 -m venv venv
source venv/bin/activate
# Install Project Dependencies (Does not include dependencies and model needed for python notebook in ct_scan_prediction):
pip install -r requirements.txt
# Run the project (Comment out functions if you would like to see executing functions one by one):
python driver.py
Libraries used in this project:
scikit-learn, matplotlib, numpy, pandas, seaborn
The study relied on datasets obtained from Kaggle, which may not comprehensively represent global patient populations or diverse demographic groups. This restricts the generalizability of the model to broader populations
Several features were dropped during preprocessing (e.g., Occupational Hazards, Anxiety, Peer Pressure), as they were deemed vague or redundant. This reduction in dimensionality might exclude potentially relevant predictors, impacting the model's comprehensiveness
This section summarizes the key functions used throughout the project along with their responsibilities and output.
📍 data_analysis/visualize_feature_distribution.py
Description:
- Loads the original Kaggle datasets
- Standardizes and merges overlapping features
- Scales numeric features
- Generates histograms showing the distribution of each feature
Output:
- Saves individual histograms to
distribution_of_original_dataset_features/
📍 data_manipulation/combine_datasets.py
Description:
- Merges
patientdata1_kaggle.csv
andpatientdata2_kaggle.csv
- Cleans and standardizes column names
- Handles missing values with a custom
ThresholdImputer
(based on skewness) - Normalizes all numeric values with MinMaxScaler
Output:
- Saves the cleaned and processed dataset as
data_manipulation/optimized_lung_cancer_data.csv
📍 data_analysis/visualize.py
Description:
- Loads the optimized dataset
- Splits data by lung cancer diagnosis (Yes/No)
- Creates side-by-side grouped histograms for all features
- Visualizes feature correlation with a heatmap
Output:
- Combined histogram:
optimized_csv_plots/combined_histogram.png
- Correlation heatmap:
optimized_csv_plots/correlation_heatmap.png
📍 data_manipulation/training_model.py
Description:
- Loads the optimized dataset
- Constructs a pipeline for preprocessing and classification using Logistic Regression
- Trains the model and evaluates it using accuracy, confusion matrix, and classification report
- Performs 5-fold cross-validation
Output:
- Prints training/testing scores and model evaluation in terminal
- Returns trained model pipeline
clf
📍 data_manipulation/questionaire.py
Description:
- Launches a CLI-based questionnaire to collect user health information
- Normalizes and formats inputs for prediction
- Uses trained model (
create_classifier()
) to predict lung cancer risk
Output:
- Prints a human-readable diagnosis and probability estimate in terminal
In addition to analyzing tabular patient data, this project includes an image-based approach to lung cancer detection using a Convolutional Neural Network (CNN).
The notebook classification_model.ipynb
(in ct_scan_prediction/
) walks through the full image classification pipeline using CT scan data.
- Load and preprocess the IQ-OTHNCCD lung cancer dataset
- Apply image augmentation (flipping, brightness/contrast, rotation)
- Build and train a CNN model using DenseNet121 via Keras and TensorFlow
- Evaluate the model on validation data using accuracy, precision, and recall
- Save the trained model to
.h5
and store final accuracy in a log file - Visualize predictions and training metrics directly in the notebook
Make sure the following libraries are installed:
tensorflow
keras
numpy
matplotlib
scikit-learn
Pillow
Augmentor
Also ensure:
- The IQ-OTHNCCD image dataset is placed inside
datasets/image_dataset/
- You have mounted Google Drive if running in Colab
Ensure you're working in a virtual environment or a Conda environment install all dependencies and necessary libraries with:
pip install -r requirements.txt
If using Google Colab, mount Google Drive using google.colab import drive followed by drive.mount('/content/drive').
Then, place the IQ-OTHNCCD lung cancer dataset in your Drive under
/MyDrive/image_dataset/The IQ-OTHNCCD lung cancer dataset/
with the following subfolders: Benign cases/, Malignant cases/, and Normal cases/. The overview of directories should look like:
└──MyDrive/
└── image_dataset/
└── The IQ-OTHNCCD lung cancer dataset/
├── Benign cases
├── Malignant cases
└── Normal cases
If running the project outside of Colab, be sure to update all file paths in the scripts (e.g., paths starting with /content/ or /MyDrive/) to match local environment.
Afterward, execute the model
python classification_model.py
This performs data augmentation, renames and balances the image datasets, splits the data into training and validation sets, and trains a CNN model based on DenseNet121. The trained model will be saved to
/MyDrive/lung_cancer_classification_model.h5.
To evaluate the model, run
python run_lung_cancer_model.py
Make sure the model file exists at the specified path and that test_img_path in the script points to the CT scan image you want to classify.
-
Trained CNN model saved to:
/MyDrive/lung_cancer_classification_model.h5
-
Final accuracy log:
ct_scan_prediction/final_trained_model_accuracy.txt
-
Prediction tool: Use
run_lung_cancer_model.py
to test individual CT scan images with the trained model to receive class prediction which is either "benign", "normal" or "malignant".
python run_lung_cancer_model.py