Protein Classification Challenge

A two-layered, data-driven strategy for protein function classification that balances interpretability, non-linear representation, and ensemble robustness.

Layer 1: Feature Compression & Meta-Learning

Goal: Reduce 4,922 original features into compact, complementary representations, then learn how to weight them per sample.

Pathway	Method	Output	Strength
Linear Compression	PCA (retain 99% variance)	Principal components	Noise reduction, preserves global variance
Sparse Selection	Lasso logistic (data-driven C)	Subset of original features	Hard feature pruning, biological interpretability
Bayesian Selection	MCMC feature selection	Probabilistic sparse subset	Uncertainty quantification, preserves original axes
Nonlinear Compression	VAE (Variational Autoencoder)	Learned latent factors	Captures complex manifolds

All four compressed representations are concatenated into a single meta-feature matrix.
A TabNet meta-learner is trained on these meta-features to soft-select and attend to the best feature space for each protein.

Layer 2: Single-Model Training & Final Ensembles

Base Models: trained on TabNet-selected features from Layer 1

Model	Pipeline
Random Forest	Baseline → RandomizedSearchCV → GridSearchCV → Optuna → Save → Predict
XGBoost	Baseline → RandomizedSearchCV → GridSearchCV → Optuna → Save → Predict
Logistic Regression (Lasso)	L1-penalized → Feature selection → Retrain → Predict
MLP Neural Network (GPU)	Baseline → Early stopping → Optuna → Save → Predict
TabNet (GPU)	Baseline → Early stopping → Optuna → Save → Predict

Final Ensembles:

Soft Voting: average predicted probabilities of all base models
Stacking (LightGBM meta-learner): learn optimal combination of base predictions

Why both?
Soft voting provides a stable baseline; stacking can squeeze extra accuracy by learning when to trust each model.

End-to-End Workflow

Data Loading & Preprocessing
Layer 1 – Dimensionality Reduction & Meta-Learning
- Fit PCA, Lasso, MCMC selector, VAE
- Concatenate outputs → TabNet meta-learner → TabNet-selected features
Layer 2 – Model Training
- Train RF, XGB, LR, MLP, TabNet on TabNet-selected features
- Hyperparameter tuning via RandomizedSearchCV → GridSearchCV → Optuna
Ensembling
- Generate soft-voting and stacking predictions
Validation & Explainability
- 5-fold CV monitoring, fold-variance analysis, McNemar’s test, bootstrap confidence intervals
- SHAP and feature importance plots
Submission
- Save single-model and ensemble CSVs with Entry, ProteinClass

Deliverables

Models: *.joblib for each tuned model and Optuna study
Compressed datasets: PCA, Lasso, MCMC, VAE outputs, TabNet-selected features
Predictions: y_pred_*.npy and formatted CSVs
Reports: Confusion matrices, classification reports, SHAP plots

This README captures our updated, layered strategy—leveraging both linear/sparse and nonlinear/manifold views, dynamically fused by TabNet, then ensembled across multiple model paradigms to maximize accuracy and robustness.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
colab_train_test		colab_train_test
data		data
dim_red		dim_red
docs		docs
embed_models		embed_models
ensemble_output		ensemble_output
feature_selector		feature_selector
final_models		final_models
notebooks		notebooks
processed_data		processed_data
utils		utils
.gitignore		.gitignore
Dataset_Full.csv		Dataset_Full.csv
README.md		README.md
X_eval_full_features_only.csv		X_eval_full_features_only.csv
X_eval_full_with_entry.csv		X_eval_full_with_entry.csv
X_eval_scaled.npy		X_eval_scaled.npy
X_test_scaled.npy		X_test_scaled.npy
X_train_full_features_only.csv		X_train_full_features_only.csv
X_train_full_with_labels.csv		X_train_full_with_labels.csv
X_train_scaled.npy		X_train_scaled.npy
best_mlp_model_optuna.joblib		best_mlp_model_optuna.joblib
combined_with_pfam.parquet		combined_with_pfam.parquet
feature_scores_combined.csv		feature_scores_combined.csv
feature_selector_env.yaml		feature_selector_env.yaml
model_results.csv		model_results.csv
optuna_mlp_study.joblib		optuna_mlp_study.joblib
rf_gridsearch.joblib		rf_gridsearch.joblib
rf_optuna_study.joblib		rf_optuna_study.joblib
rf_optuna_study_full.joblib		rf_optuna_study_full.joblib
rf_optuna_warm.joblib		rf_optuna_warm.joblib
rf_randomizedcv.joblib		rf_randomizedcv.joblib
vhse_predictions.csv		vhse_predictions.csv
xgb_gridsearch.joblib		xgb_gridsearch.joblib
xgb_optuna_study_full.joblib		xgb_optuna_study_full.joblib
xgb_optuna_warm.joblib		xgb_optuna_warm.joblib
xgb_randomizedcv.joblib		xgb_randomizedcv.joblib
y_pred_mlp_eval_optuna.npy		y_pred_mlp_eval_optuna.npy
y_test_label_encoded.npy		y_test_label_encoded.npy
y_train_full_labels.csv		y_train_full_labels.csv
y_train_label_encoded.npy		y_train_label_encoded.npy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Protein Classification Challenge

Layer 1: Feature Compression & Meta-Learning

Layer 2: Single-Model Training & Final Ensembles

End-to-End Workflow

Deliverables

About

Uh oh!

Releases

Packages

Languages

SecondBook5/ProtClassify

Folders and files

Latest commit

History

Repository files navigation

Protein Classification Challenge

Layer 1: Feature Compression & Meta-Learning

Layer 2: Single-Model Training & Final Ensembles

End-to-End Workflow

Deliverables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages