This project leverages machine learning algorithms to predict whether a breast tumor is benign or malignant using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. The goal is to compare various models and identify the most accurate and reliable one for diagnosis.
Breast cancer is a major health challenge, and early detection is crucial for improving patient outcomes. In this project, we implement and evaluate four machine learning algorithms:
- Feedforward Neural Network (FNN)
- Support Vector Machine (SVM)
- Extreme Gradient Boosting (XGBoost)
- Logistic Regression (LR)
Each model's performance is evaluated using key metrics: accuracy, precision, recall, sensitivity, specificity, and AUC (Area Under the Curve).
Name: Wisconsin Diagnostic Breast Cancer (WDBC)
Source: UCI Machine Learning Repository
- Instances: 569 samples
- Features: 30 numeric tumor characteristics (e.g., radius, perimeter, concavity)
- Target Variable:
- Benign (B) → 0
- Malignant (M) → 1
The dataset is split into 80% training and 20% testing for model evaluation.
-
Feedforward Neural Network (FNN):
Captures non-linear patterns through hidden layers and backpropagation. -
Support Vector Machine (SVM):
Trains on high-dimensional data with various kernels (linear, RBF). -
XGBoost:
A fast, efficient gradient boosting algorithm ideal for structured data. -
Logistic Regression (LR):
Serves as a baseline model with straightforward interpretability.
-
Install Required Libraries:
pip install numpy pandas scikit-learn matplotlib seaborn tensorflow keras xgboost
-
Clone or Download the Repository:
git clone <your-repository-url> cd <your-repository-folder>
-
Launch Jupyter Notebook:
jupyter notebook breast_cancer_final.ipynb
-
Preprocess the Data:
- Missing values are handled.
- The target labels are mapped: 'M' → 1, 'B' → 0.
- Features are scaled using StandardScaler.
-
Train and Evaluate the Models: Execute the cells to train each model:
evaluate_model(fnn_model, "Feedforward Neural Network", X_train, y_train, X_test, y_test) evaluate_model(best_svm_model, "SVM", X_train, y_train, X_test, y_test) evaluate_model(best_xgb_model, "XGBoost", X_train, y_train, X_test, y_test) evaluate_model(logistic_model, "Logistic Regression", X_train, y_train, X_test, y_test)
-
Performance Metrics: Each model’s accuracy, precision, recall, F1-score, sensitivity, specificity, and AUC will be printed for both the training and testing datasets.
Model | Accuracy (Test) | Precision | Recall | AUC |
---|---|---|---|---|
Feedforward Neural Network | 98.25% | 100.00% | 95.34% | 99.14% |
Support Vector Machine (SVM) | 98.25% | 100.00% | 95.34% | 97.67% |
XGBoost | 97.36% | 97.61% | 95.34% | 96.97% |
Logistic Regression | 97.36% | 97.61% | 95.34% | 96.97% |
- FNN and SVM achieved the highest test accuracy (98.25%), making them excellent candidates for breast cancer prediction.
- XGBoost and Logistic Regression also showed strong performance, proving the value of both ensemble models and simple linear classifiers.
- Hyperparameter tuning: Further refinement to optimize model performance.
- Advanced Ensemble Models: Explore stacking multiple algorithms.
- Model Explainability: Integrate SHAP or LIME for better interpretability.
- Deployment: Deploy the models in a web-based interface for real-time diagnosis.
- Scalability: Apply the models to larger, more complex datasets to test generalizability.
- Sanika Mulik
- Nikita Kedari
- Juee Shinde
This project was made for TY BTech CSE: Artificial Intelligence and Expert Systems, in Semester V, July-Dec 2024
Supervisor: Prof. Pramod Mali
Institution: MIT World Peace University, Pune