Skip to content

Riciokzz/Spaceship-Titanic

Repository files navigation

SpaceShip Titanic

Introduction

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

The aim of our work is to predict which passengers are transported to an alternate dimension, I will try to find main points which contributes to the risk of being transported to the alternative dimension while traveling with Spaceship Titanic.

Project is divided into separate parts:

  • Introduction
  • Notebook Preparation
  • Data Cleaning
  • Feature engineering
  • Exploratory Data Analysis
  • Missing Values - Feature Engineering
  • Modelling
  • Final Model
  • AutoML Model
  • Summary
  • Suggestion For Improvement
  • Project summary

  • PassengerId split into GroupId and GroupSize.
  • VIP column was removed as it not giving us any useful information.
  • Filled missing values for both train and test datasets using various methods based on EDA analysis.
  • CryoSleep and SpentMoney show medium correlation with the target.
  • Numeric features show medium to low correlation.
  • Data is highly balanced.
  • One Home Planet per Family.
  • Most cabins stay at Decks E, F, and G, while cabin T shows an outlier.
  • Distribution over both decks is similar for Home Planet and Destination.
  • All people in groups stay on one side of the ship.
  • About ~91% of the cabins have only one destination planet.
  • 35.8% of people on the ship are in cryo sleep, 64.2% are awake.
  • Filled missing values of Cabin Number using a linear regression model.
  • High outliers with spending features, capped maximum values at the 0.95 quantile.
  • People with age 12 and lower didn't spend money at all.
  • Most outliers stay with older people, who also spend the most.
  • The age distribution is skewed to the right, indicating a longer tail on the right side.
  • Based on overlapping distribution, it looks like younger people get transported more than older.
  • Additional bins created for age groups.
  • All groups have only one unique Home planet.
  • Reject the null hypothesis: There is a significant association between spaceship side and transported status.
  • Created ML Dummy Classifier model for baseline.
  • Used Boruta to check feature importance.
  • Created multiple ML models while evaluating which is the best based on F1 score and accuracy.
  • Used the best model to tune it a little bit more, evaluate it using the ROC curve, and predict results.
  • Created AutoML model, ran it for 10 minutes, to check its performance.
  • Submitted the best predictions to Kaggle competition - with the best accuracy of 0.80313.
  • Requirements for the project

    To install all necessary libraries use - pip install -r requirements.txt

    Launch ML model locally

    Install Docker and Java

    Build Docker image - docker build -t spaceship-titanic-app .

    Run container - docker run -p 5000:5000 spaceship-titanic-app

    Main page - http://localhost:5000

    For predictions run - python predict.py

    Predictions will be saved in save folder where test.csv was placed - data/api_predictions.csv

    To check running containers - docker ps

    To stop docker container - docker stop 'container_id'

    SpaceShip Titanic Dataset

    Dataset can be downloaded from Kaggle.

    License

    This project is licensed under the MIT License - see the LICENSE file for details.

    Contact Information

    Email LinkedIn GitHub