Breast cancer is the most prevalent cancer among women globally, accounting for nearly 25% of all cancer cases. In 2015 alone, it affected over 2.1 million individuals. The disease originates when cells in the breast begin to grow uncontrollably, forming tumors that can be detected via X-ray or felt as lumps.
The dataset used in this project is sourced from Kaggle: Breast Cancer Dataset
The dataset contains 569 rows and 32 columns, of which 30 features are used to train the model. These features represent statistical measurements of a breast tumor’s shape, size, texture, and boundary characteristics—captured through imaging—to help classify it as benign or malignant.
The model is trained using Logistic Regression and optimized using Gradient Descent. A cost function is computed to evaluate how well the model predicts the classification, and this value is minimized iteratively.
A plot of the cost function against the number of iterations shows convergence around 1000 iterations, indicating proper learning during training.
- Initial version: Implemented without vectorization (see
breastcancerpredictionwithoutvectorisation.ipynb
) - Training time: 446 seconds
- Model accuracy: 98.24%
- Improvement suggestions: Include cross-validation, more data, or quadratic features to improve accuracy.
- Optimized version: Vectorized implementation (see
breastcancerprediction.ipynb
) - Training time: 2 seconds
- Result: Identical accuracy, drastically reduced computation time
- This highlights the power of using NumPy-based vectorized operations for scalable model training.
Both notebooks are provided for educational comparison — demonstrating how vectorization can lead to massive improvements in performance without changing model logic.
Made with ❤️ for machine learning and performance optimization.