A complete data cleaning and preprocessing project using the Titanic dataset from Kaggle.
This project focuses on preparing the Titanic dataset for machine learning.
We aim to clean, transform, and engineer features using best practices.
- Missing value handling (
Age
,Cabin
,Embarked
) - Outlier detection and treatment
- Feature extraction (e.g. extracting titles from names)
- Encoding categorical variables (
Sex
,Embarked
) - Feature scaling (Standardization and Normalization)
pandas
numpy
matplotlib
seaborn
sklearn
titanic-data-cleaning/
├── data/
│ ├── train.csv
│ ├── test.csv
├── notebooks/
│ └── 01_data_cleaning.ipynb
├── README.md
└── requirements.txt
You can follow the full data cleaning process in the notebook:
01_data_cleaning.ipynb
- Exploratory Data Analysis (EDA)
- Handling null and duplicate values
- Visualizing distributions and outliers
- Preparing a dataset for machine learning
- Classification models on cleaned data
- Model evaluation and tuning
- Submission to Kaggle!
This project is licensed under the MIT License.