This repository contains a comprehensive solution for the Titanic - Machine Learning from Disaster Kaggle competition. The goal is to predict which passengers survived the Titanic shipwreck.
The solution includes:
- Data Exploration - Analyzing the training data to understand patterns
- Feature Engineering - Creating new features to improve model performance
- Model Training - Training multiple ML models and creating an ensemble
- Prediction - Generating predictions for submission
- Comprehensive data preprocessing
- Feature extraction from passenger names (titles)
- Family size and deck information extraction
- Missing value imputation based on passenger characteristics
- Model ensemble combining Random Forest, Gradient Boosting, and SVM
titanic_solution.py
- Complete Python script solutiontitanic_solution.ipynb
- Jupyter notebook version with visualizationsrequirements.txt
- Required Python packagessubmission.csv
- Generated predictions for competition submission
The dataset should be placed in the data/
directory with the following files:
train.csv
- Training datatest.csv
- Test data for predictionsgender_submission.csv
- Example submission file
-
Clone this repository
-
Install required packages
pip install -r requirements.txt
-
Run the solution
For Python script:
python titanic_solution.py
For Jupyter notebook:
jupyter notebook titanic_solution.ipynb
-
Submit predictions
The
submission.csv
file will be generated, which can be submitted to Kaggle.
The solution achieves approximately 80-82% accuracy on cross-validation.
Kaggle Competition Score: 0.77990
This score places the solution in a competitive position on the Kaggle leaderboard. The score was achieved using the ensemble approach of combining Random Forest, Gradient Boosting, and SVM classifiers.
- Gender was a crucial factor in survival (females had much higher survival rates)
- Passenger class strongly correlated with survival (1st class passengers had better chances)
- Age played an important role (children were prioritized)
- Family size affected survival chances
The top features that contributed most to prediction accuracy were:
- Sex (gender)
- Title extracted from name
- Fare
- Age
- Passenger class
The solution followed a systematic approach:
- Initial data cleaning and exploration
- Feature engineering to create new predictive variables
- Testing multiple models independently
- Hyperparameter tuning for best performing models
- Creating an ensemble of the top models
- Final prediction on the test dataset
Potential ways to improve the model:
- More advanced feature engineering
- Additional models in the ensemble
- More extensive hyperparameter tuning
- Neural network implementation
- Additional external data sources
- Advanced imputation techniques for missing values