Skip to content

BahriDogru/Personality_Type_Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Personality Type Prediction Project (Kaggle Playground Series - S5E7)

Project Overview

This project was developed for participation in the Kaggle Playground Series - Season 5, Episode 7 machine learning competition.
The main objective is to predict whether an individual is an Introvert or Extrovert based on their behavioral, social, and activity-related traits.

Dataset

The dataset contains survey-based features reflecting user behavior, including:

  • Time_spent_Alone
  • Stage_fear
  • Social_event_attendance
  • Going_outside
  • Drained_after_socializing
  • Friends_circle_size
  • Post_frequency
  • Personality (Target: Extrovert or Introvert)

Project Structure

The project mainly consists of:

  • main.py: A python file containing all stages from data loading to submission, including EDA, feature engineering, model selection, and prediction.
  • preprocessing.py: A reusable module that includes the data transformation pipeline for both train and test datasets. It ensures consistency between model training and final predictions.

Workflow

The pipeline includes the following steps:

  1. Data Loading: Train and test datasets are loaded from Kaggle inputs.
  2. Exploratory Data Analysis (EDA): Summary statistics, missing value inspection, and visualization of feature distributions.
  3. Missing Value Imputation:
    • KNNImputer is used for numeric columns.
    • SimpleImputer with mode strategy is used for categorical variables.
  4. Outlier Handling:
    • Outliers are capped using the IQR method (adjusted thresholds).
  5. Feature Engineering:
    • New features like NEW_Alone_Level, NEW_Social_Score, and categorical bins were created to better capture personality traits.
  6. Encoding:
    • Binary features were label encoded.
    • Multiclass categorical variables were one-hot encoded using sklearn.OneHotEncoder with handle_unknown='ignore'.
  7. Model Comparison:
    • Several classifiers (CatBoost, XGBoost, LightGBM, SVC, RandomForest, etc.) were compared.
    • CatBoost (GPU) performed the best on the validation set.
  8. Hyperparameter Optimization:
    • Optuna was used to tune CatBoost hyperparameters with cross-validation using f1_macro as the objective.
  9. Final Prediction and Submission:
    • The final model was trained on the full training dataset using the best parameters.
    • Predictions were made on the test dataset and saved as submission.csv.

Setup and Running

To run this project on your local machine or cloud:

  1. Clone the Repository:

    git clone https://github.com/BahriDogru/Personality_Type_Classification.git
    cd predicting_personality_type
  2. Install Dependencies:

    Using the provided environment.yaml file:

    conda env create -f environment.yaml
    conda activate personality_prediction_env
  3. Prepare the Dataset: Download train.csv and test.csv from the competition page
    Place them in a dataset/ folder:

    .
    ├── dataset/
    │   ├── train.csv
    │   └── test.csv
    ├── main.ipynb
    ├── environment.yaml
    ├── .gitignore
    ├── preprocessing.py
    └── README.md
    
  4. Run the Script:

    python main.py

Results

After model comparison and tuning, the CatBoostClassifier (GPU) model provided the best results.
📈 Private Leaderboard Score: 0.974089 (F1 Score)