This project was developed for participation in the Kaggle Playground Series - Season 5, Episode 7 machine learning competition.
The main objective is to predict whether an individual is an Introvert or Extrovert based on their behavioral, social, and activity-related traits.
The dataset contains survey-based features reflecting user behavior, including:
Time_spent_Alone
Stage_fear
Social_event_attendance
Going_outside
Drained_after_socializing
Friends_circle_size
Post_frequency
Personality
(Target: Extrovert or Introvert)
The project mainly consists of:
main.py
: A python file containing all stages from data loading to submission, including EDA, feature engineering, model selection, and prediction.preprocessing.py
: A reusable module that includes the data transformation pipeline for both train and test datasets. It ensures consistency between model training and final predictions.
The pipeline includes the following steps:
- Data Loading: Train and test datasets are loaded from Kaggle inputs.
- Exploratory Data Analysis (EDA): Summary statistics, missing value inspection, and visualization of feature distributions.
- Missing Value Imputation:
- KNNImputer is used for numeric columns.
- SimpleImputer with mode strategy is used for categorical variables.
- Outlier Handling:
- Outliers are capped using the IQR method (adjusted thresholds).
- Feature Engineering:
- New features like
NEW_Alone_Level
,NEW_Social_Score
, and categorical bins were created to better capture personality traits.
- New features like
- Encoding:
- Binary features were label encoded.
- Multiclass categorical variables were one-hot encoded using
sklearn.OneHotEncoder
withhandle_unknown='ignore'
.
- Model Comparison:
- Several classifiers (CatBoost, XGBoost, LightGBM, SVC, RandomForest, etc.) were compared.
- CatBoost (GPU) performed the best on the validation set.
- Hyperparameter Optimization:
Optuna
was used to tune CatBoost hyperparameters with cross-validation usingf1_macro
as the objective.
- Final Prediction and Submission:
- The final model was trained on the full training dataset using the best parameters.
- Predictions were made on the test dataset and saved as
submission.csv
.
To run this project on your local machine or cloud:
-
Clone the Repository:
git clone https://github.com/BahriDogru/Personality_Type_Classification.git cd predicting_personality_type
-
Install Dependencies:
Using the provided environment.yaml file:
conda env create -f environment.yaml conda activate personality_prediction_env
-
Prepare the Dataset: Download
train.csv
andtest.csv
from the competition page
Place them in adataset/
folder:. ├── dataset/ │ ├── train.csv │ └── test.csv ├── main.ipynb ├── environment.yaml ├── .gitignore ├── preprocessing.py └── README.md
-
Run the Script:
python main.py
After model comparison and tuning, the CatBoostClassifier (GPU) model provided the best results.
📈 Private Leaderboard Score: 0.974089
(F1 Score)