A machine learning project predicting lung cancer severity based on patient lifestyle and health factors. The project includes data cleaning, visualization, and classification using Decision Tree, Random Forest, and Logistic Regression.
lung_cancer_dataset.csv
: Source dataset from Kagglelung_cancer_prediction.ipynb
: Main Jupyter notebook with full analysis
Kaggle: Cancer Patients and Air Pollution
- Data cleaning and preprocessing with
pandas
- Data visualization using
matplotlib
andseaborn
- Feature engineering and correlation analysis
- Classification models:
- Decision Tree
- Random Forest
- Logistic Regression
- Explores the relationship between air pollution, smoking, and lung disease severity
- Visualizes gender distribution, disease severity levels, and risk factors
- Compares model performance through accuracy, confusion matrices, and classification reports
- Achieved accuracy up to 87% using the Decision Tree model
- Decision Tree outperformed Random Forest and Logistic Regression overall
- Identified strong correlations between chronic lung disease and factors like smoking, air pollution, and occupational hazards
- Python
- Pandas, NumPy
- Matplotlib, Seaborn
- Scikit-learn
- Jupyter Notebook