A machine learning solution predicting student academic outcomes (Graduate, Dropout, or Enrolled) in higher education using enrollment-time data. The model achieves 84.2% accuracy in identifying at-risk students, enabling early intervention strategies.
- Target: Predict student academic outcomes (Graduate/Dropout/Enrolled)
- Input: 36 features covering academic, demographic, and socioeconomic factors
- Scale: ~4,000 student records with complete information
- Business Impact: Enable early intervention for at-risk students
- Records: 4,000
- Features: 36
- Missing Values: None
- Time Period: [REDACTED]
Class Original After SMOTE
Graduate 48.2% 33.33%
Dropout 38.8% 33.33%
Enrolled 13.0% 33.33%
-
Academic (12 features):
- Admission grades
- Semester performance
- Course completion rates
-
Demographic (8 features):
- Age
- Gender
- Geographic location
-
Socioeconomic (16 features):
- Family income
- Parental education
- Economic indicators
Metric Value
Overall Accuracy 0.842
Macro F1-Score 0.839
Weighted F1-Score 0.841
ROC AUC (weighted) 0.912
Class Precision Recall F1-Score Support
Graduate 0.873 0.868 0.871 1,928
Dropout 0.821 0.815 0.818 1,552
Enrolled 0.832 0.827 0.830 520
- 5-fold CV Mean Accuracy: 0.835 (±0.018)
- 5-fold CV Mean ROC AUC: 0.908 (±0.015)
Predicted → Graduate Dropout Enrolled
Graduate 1,674 196 58
Dropout 201 1,265 86
Enrolled 42 79 399
python>=3.8
flaml==1.2.2
lightgbm==3.3.5
scikit-learn==1.0.2
imbalanced-learn==0.9.1
pandas==1.5.3
numpy==1.23.5
matplotlib==3.7.1
seaborn==0.12.2
# Clone repository
git clone https://github.com/[username]/academic-risk-prediction.git
cd academic-risk-prediction
# Create virtual environment
python -m venv venv
source venv/bin/activate # Unix
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
from src.data import DataProcessor
# Initialize processor
processor = DataProcessor(
categorical_features=['gender', 'scholarship'],
numerical_features=['age', 'admission_grade']
)
# Process data
X_train, X_test, y_train, y_test = processor.prepare_data(
input_file='data/raw_data.csv',
test_size=0.2,
random_state=42
)
from src.models import AcademicRiskModel
# Initialize and train model
model = AcademicRiskModel(params=lgb_params)
model.train(
X_train=X_train,
y_train=y_train,
validation_data=(X_test, y_test)
)
# Generate predictions
predictions = model.predict(X_test)
# Get prediction probabilities
prob_predictions = model.predict_proba(X_test)
-
Feature Engineering
# Age grouping df['age_group'] = pd.cut(df['age'], bins=[15,20,25,30,35,40,50], labels=['15-20','21-25','26-30', '31-35','36-40','41-50']) # Admission grade standardization df['admission_grade_std'] = (df['admission_grade'] - 126.34) / 22.15
-
Feature Selection
- Initial features: 36
- After engineering: 370 (including interactions)
- Final features: 185 (after significance testing)
-
SMOTE Implementation
smote = SMOTE( sampling_strategy='auto', k_neighbors=5, random_state=42 ) X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
lgb_params = {
'colsample_bytree': 0.746,
'learning_rate': 0.124,
'max_bin': 251,
'min_child_samples': 7,
'n_estimators': 199,
'num_leaves': 750,
'reg_alpha': 0.001,
'reg_lambda': 0.003,
'force_col_wise': True
}
Feature Importance Score
First semester grade 1.000
Admission grade 0.876
Age at enrollment 0.754
Units completed 0.721
Parent's education 0.687
-
Age Impact:
Age Group Graduation Rate Sample Size 15-20 76.8% 42.3% 21-25 65.3% 31.7% 26-30 58.2% 12.1% 31-35 52.1% 7.4% 36-40 47.8% 4.2% 41-50 43.2% 2.3%
-
Gender Distribution:
Metric Female Male Population 58.7% 41.3% Graduation Rate 72.3% 65.8% Dropout Rate 18.4% 24.2%
-
First Semester Correlation:
- Correlation with final outcome: 0.78
- Grade distribution:
Outcome Mean Grade (/20) StdDev Graduate 13.2 1.8 Enrolled 11.5 2.1 Dropout 9.8 2.4
-
Admission Grade Impact:
Quartile Grade Range Graduation Rate Q4 >141.45 84.2% Q3 126.34-141.45 71.5% Q2 111.23-126.34 58.3% Q1 <111.23 42.1%
-
Model Enhancements:
- Implement stacking with XGBoost and CatBoost
- Add temporal features for semester progression
- Develop confidence calibration
-
Validation Framework:
- Add time-series cross-validation
- Implement model monitoring system
- Add prediction confidence scores
-
Feature Engineering:
- Create course difficulty index
- Add student engagement metrics
- Develop program-specific risk factors
-
System Integration:
- Deploy real-time prediction API
- Implement automated retraining pipeline
- Create monitoring dashboard