This notebook is a solution for the "Spaceship Titanic" machine learning competition on Kaggle. The task is to predict whether a passenger was transported to an alternate dimension based on various characteristics.
The dataset is structured like a typical tabular classification problem:
train.csv
: Includes ~8,700 passengers with labeled target (Transported
)test.csv
: ~4,300 passengers without labels
- Categorical:
HomePlanet
,CryoSleep
,Destination
,VIP
,Cabin
- Numerical:
Age
,RoomService
,FoodCourt
,ShoppingMall
,Spa
,VRDeck
- Target:
Transported
(Boolean: True/False)
- Checked class balance: the dataset is quite balanced between transported and not transported.
- Visualized distribution of numerical features like
Age
andSpa
spending. - Plotted categorical features using
countplot()
andpie charts
to spot trends. - Observed that features like
CryoSleep
,Destination
, andVIP
had visible correlations with the target.
- Dropped:
Name
: mostly irrelevant for prediction.Cabin
: many missing values, complex to parse initially.
- Handled missing values:
- For categoricals like
HomePlanet
, filled with"unknown"
. - For numerical spendings (
RoomService
,FoodCourt
, etc.), filled with0
. - For
Age
, used median imputation.
- For categoricals like
- Converted categorical to numerical using
LabelEncoder
.
- Simplified data types (e.g., from float64 to float32).
- Created no new features, kept baseline model clean and interpretable.
- Trained and evaluated the following classifiers:
- Logistic Regression
- Decision Tree
- Random Forest
- Extra Trees Classifier
- LightGBM (LGBMClassifier)
- XGBoost (XGBClassifier)
- Used
train_test_split
for validation. - Evaluated using:
accuracy_score
classification_report
confusion_matrix
- Compared each model’s performance. Tree-based models (especially XGBoost and LGBM) outperformed others in both accuracy and generalization.
- Applied
GridSearchCV
to:- LGBMClassifier
- RandomForestClassifier
- Tuned parameters like:
n_estimators
max_depth
learning_rate
(for LGBM)
- Best model: LightGBM with tuned hyperparameters.
- Achieved over 0.80 accuracy on validation set.
- Saved predictions into CSV for Kaggle submission.
pandas
numpy
matplotlib
seaborn
scikit-learn
xgboost
lightgbm
To build a solid baseline ML pipeline and test the performance of various models on a real-world-style dataset. This project is useful for practicing:
- End-to-end ML workflows
- Data cleaning strategies
- Comparison of classic and modern classifiers
- Working with structured data from competitions