Authors: Anna‐Maria Fiederling & Louis Brammer Institution: Católica Lisbon School of Business and Economics (M.Sc. Business Analytics) Course: AI Fairness and Interpretability (May 2025) Overview
Student dropout poses a significant challenge in higher education, with individual and societal consequences that extend far beyond graduation rates. In this project, we develop a comprehensive, fairness‐aware machine learning pipeline to predict dropout risk under an assistive‐intervention paradigm. Our goal is twofold:
Maximize Recall of At‐Risk Students
Ensure Equitable Treatment Across Demographic Groups
We combine statistical testing, multiple model baselines, bias‐mitigation techniques (pre‐processing, in‐processing, post‐processing), and explainability (SHAP) to produce an early‐warning tool that is both accurate and fair.
Data
We use the publicly available “Predict Students Dropout and Academic Success” dataset from the UCI Machine Learning Repository, containing 4,424 entries and 36 features, including:
Demographics & Background: Gender, Age at enrollment, Nationality, Scholarship holder
Academic Records: Application mode, admission grade, 1st/2nd semester approved units, grades, evaluations
Parental & Socioeconomic Indicators: Mother’s/Father’s qualification & occupation, regional unemployment/inflation/GDP
Other Factors: Educational special needs, displaced status, debtor status, tuition fees up to date
Target: Dropout = 1, Enrolled/Graduated = 0 (binary) for fairness‐aware modeling
1. Exploratory Data Analysis (EDA)
Chi‐Square Tests of Independence
• Hypothesis: Male students are approximately twice as likely to drop out as females.
• Results: Gender (χ² = 183.16, p < 0.001), Scholarship (χ² = 265.10, p < 0.001), parental education & occupation all significant (p < 0.001); Nationality & special needs non‐significant.
Cramér’s V Effect Sizes
• Scholarship status: V ≈ 0.24 (strongest association)
• Gender: V ≈ 0.20 (moderate)
• Mother’s Occupation: V ≈ 0.20; Father’s Occupation: V ≈ 0.17
• Indicates which predictors carry the most signal (chi‐square & V code in 01_EDA.ipynb).
*2. Baseline Models
Logistic Regression
• Focused on maximizing recall (to minimize false negatives, i.e., missed dropouts).
• Metrics: Accuracy = 0.88, Dropout Recall = 0.75, Precision = 0.88, F₁ = 0.81, AUC = 0.911.
Random Forest
• Improved AUC = 0.926 but lower dropout recall = 0.70 (more false negatives).
XGBoost
• Highest absolute dropout catch: 214/284 → Recall = 0.75 (ties LR), Precision = 0.84, F₁ = 0.79, Accuracy ≈ 0.87.
• Selected as best “assistive” baseline prior to fairness interventions.
Keras Neural Network
• Accuracy = 0.88, Dropout Recall = 0.71, Precision = 0.88, F₁ = 0.79
• Did not outperform XGBoost or LR on recall; increased complexity without clear gain.
3. Fairness Audit & Pre‐Processing
IBM AI Fairness 360 Toolkit (Bellamy et al., 2019)
• Demographic Parity (DP): Difference in “non‐dropout” prediction rates (male vs female).
• Equal Opportunity (EO): Difference in true‐positive rates (i.e., recall) across gender.
Pre‐Processing Mitigation
Reweighing (Kamiran & Calders, 2012)
• Adjusts instance weights so that gender groups have equal statistical weight.
• Post‐reweighing metrics: ΔDP = 0.178, ΔEO = 0.022, Accuracy = 0.88, Recall = 0.76.
Disparate Impact Remover (DI Remover) (Feldman et al., 2015)
• Transforms feature distributions to approximate parity.
4. In‐Processing & Post‐Processing Mitigation
In‐Processing (Exponentiated Gradient) (Hardt, Price & Srebro, 2016)
• Trains a fairness‐constrained classifier to minimize loss subject to DP or EO constraints.
• Results: ΔDP ≈ 0.02, ΔEO ≈ 0.35 (unsatisfactory EO), Accuracy = 0.826.
Post‐Processing (Threshold Optimizer) (Pleiss et al., 2017)
• Uses a trained model and adjusts decision thresholds per group to satisfy EO.
• Combined with DI Remover: Pre‐DP = +0.120, Pre‐EO = +0.072, Post‐DP = +0.095, Post‐EO = +0.079, Accuracy = 0.866.
• When optimized, yields dropout recall = 0.835, precision = 0.714 at threshold = 0.20, ΔDP = 0.125, ΔEO = 0.012.
++5. SHAP Explainability++
SHAP (SHapley Additive exPlanations) (Lundberg & Lee, 2017)
• Global Importance (Beeswarm & Bar Charts) (Fig. 15–19)
– Top features by mean |SHAP|:
1. Approved 2nd‐semester credits (mean |SHAP| ≈ 0.67)
2. Tuition fees up to date (≈ 0.20)
3. Approved 1st‐semester credits (≈ 0.15)
4. 2nd‐semester grade (≈ 0.10)
5. Course of study (≈ 0.10)
– Shows demographic |SHAP| (Gender_1 ≈ 0.01, Scholarship_holder_1 ≈ 0.01) are marginal.
• Individual Force & Waterfall Plots (Fig. 24–25)
– Example A (high‐risk): Final risk = 0.879; key drivers: no 2nd‐semester passes (+0.38), high‐risk course (+0.10), poor 1st‐semester (3 passes, +0.08).
– Example B (low‐risk): Final risk = 0.020; protective factors: low‐risk course (–0.28), high admission grade (–0.12).
• Decision Paths & Cluster Analysis (Fig. 20–21, 26)
– Three risk cohorts: high (SHAP > +1.0), medium (≈ [–0.2, +0.2]), and secondary high.
– Suggests tiered intervention: urgent outreach, routine monitoring, targeted follow‐up.
• Interactions (Fig. 22–23)
– Age: Older students derive larger negative shifts per approved credit.
– Grade: Higher grades amplify credit’s protective effect.
Statistical Rigor: Chi‐square tests and Cramér’s V confirmed which sensitive features truly matter (σ² tests of association).
Model Diversity: We compared linear (LR), ensemble (RF), gradient boosting (XGB), and neural architectures—identifying the optimal trade‐off between recall and overall performance.
Fairness Engineering: Hands‐on implementation of pre‐processing (Reweighing, Disparate Impact Remover), in‐processing (Exponentiated Gradient), and post‐processing (Equalized Odds) methods, illustrating real‐world trade‐offs between Demographic Parity and Equal Opportunity (Hardt et al., 2016; Pleiss et al., 2017).
Explainability Focus: Extensive SHAP‐based analysis shows how features—especially academic progress—drive predictions and how to interpret individual risk scores for targeted interventions (Lundberg & Lee, 2017).
By integrating accuracy, fairness, and interpretability, this project demonstrates an end‐to‐end pipeline for developing trustworthy, equitable predictive models in high‐stakes educational settings.