Skip to content

This project is my submission for Kaggle's March Machine Learning Mania 2025 competition. The goal is to predict the probability that one NCAA basketball team will beat another in the tournament, using historical data and advanced machine learning techniques.

Notifications You must be signed in to change notification settings

Majdi21926/2025-Kaggle-Competition-March-Machine-Learning-Mania-2025-NCAA-Tournament-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 

Repository files navigation

#๏ปฟ๐Ÿ€ March Machine Learning Mania 2025 โ€“ NCAA Tournament Prediction

๐Ÿ“Œ Overview

This project is my submission for Kaggle's March Machine Learning Mania 2025 competition. The goal is to predict the probability that one NCAA basketball team will beat another in the tournament, using historical data and advanced machine learning techniques.

๐Ÿ› ๏ธ Project Workflow

1๏ธ. Data Exploration

  • The dataset consists of 35 CSV files, including detailed and compact results of regular seasons and tournaments, team seeds, and sample submission files.
  • Key datasets used:
    • Regular Season Detailed Results (Men & Women)
    • NCAA Tournament Detailed Results (Men & Women)
    • NCAA Tournament Seeds

2๏ธ. Feature Engineering

We engineered meaningful features to improve model predictions, including:

  • SeedNumDiff: Difference in tournament seed numbers between teams.
  • WinRateDiff: Difference in win rates between teams.
  • AvgPointDifferentialDiff: Difference in average point differential.
  • AvgPointsScoredDiff & AvgPointsAllowedDiff: Differences in offensive & defensive performance.
  • EloRatingDiff: Difference in team Elo ratings.
  • TourneyWinPctDiff: Difference in past tournament win percentages.

3๏ธ. Feature Importance Analysis

We analyzed feature importance using permutation importance to identify the most influential features:

  • EloRatingDiff and SeedNumDiff emerged as the most important predictors.

4๏ธ. Model Training & Evaluation

Approach 1: Baseline Models

  • Logistic Regression
    • Brier Score: 0.2498
    • ROC AUC Score: 0.5355
  • Random Forest
    • Brier Score: 0.2088
    • ROC AUC Score: 0.7386
  • XGBoost Classifier
    • Brier Score: 0.1813
    • ROC AUC Score: 0.8024

Approach 2: Fine-Tuning XGBoost

  • Hyperparameter tuning was applied to optimize XGBoost.
  • Achieved:
    • Brier Score: 0.3304
    • ROC AUC Score: 0.7313

Approach 3: Fine-Tuning RandomForest (Best Model)

  • Hyperparameter tuning was applied to optimize RandomForest.
  • Achieved**:**
    • Brier Score: 0.2098
    • ROC AUC Score: 0.8763

5๏ธ. Final Submission Preparation

  • Used the SampleSubmissionStage2.csv to prepare matchups.
  • Applied the best trained model to predict win probabilities.
  • Ensured submission format matches Kaggleโ€™s requirements (ID format: 2025_TeamA_TeamB).

๐Ÿ“ˆ Key Learnings & Challenges

โœ… Importance of feature engineering in sports analytics. โœ… Season feature was problematic for predictions. โœ… Classification models vs. probability-based submissions.

๐Ÿš€ Next Steps

  • Try LightGBM & CatBoost for comparison.
  • Experiment with deep learning approaches.
  • Improve generalization to avoid overfitting.

๐Ÿ† Conclusion

By focusing on feature engineering, feature selection, and model optimization, I significantly improved my prediction accuracy. This project highlights how data-driven insights can be used in sports analytics & tournament predictions.

๐Ÿ“‚ Full Code & Notebooks available in this repo! ๐Ÿ”ฅ

About

This project is my submission for Kaggle's March Machine Learning Mania 2025 competition. The goal is to predict the probability that one NCAA basketball team will beat another in the tournament, using historical data and advanced machine learning techniques.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published