Machine learning approaches for predicting signal peptide secretion efficiency in Bacillus subtilis. This repository contains two complementary methodologies:
- Random Forest Regression: Reproduction of physicochemical feature-based prediction
- Neural Network Classifier: Deep learning approach using protein embeddings
Performance Summary:
- Random Forest: Test MSE 1.191 WA² (R² = 0.750)
- Neural Network: Test MSE 0.950 WA² (R² = 0.800)
grasso-reproduction-study/
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── quick_run.py # Interactive execution interface
│
├── Random Forest Components/
│ ├── grasso_reproduction_tool.py # Main reproduction analysis
│ ├── 01_grasso_reproduction_complete.py # Single-file implementation
│ ├── data_verification_tool.py # Data integrity verification
│ ├── config.py # Configuration parameters
│ └── sb2c00328_si_011.csv # Signal peptide dataset
│
├── Neural Network Components/
│ ├── classifier.py # Neural network classifier
│ ├── train_classifier.py # Training pipeline
│ ├── test_classifier.py # Model evaluation
│ ├── protein_embeddings_experiment.py # Embedding data handling
│ └── experiment_framework.py # Experiment management
│
├── data/ # Protein embeddings
│ ├── trainAA_*.parquet # Training embeddings
│ └── testAA_*.parquet # Test embeddings
│
└── results/ # Generated outputs
├── classifier_results/ # Neural network results
└── grasso_reproduction_results.png # Random forest visualization
- Python 3.8+
- 8GB RAM recommended
# Clone repository
git clone <repository-url>
cd grasso-reproduction-study
# Install dependencies
pip install -r requirements.txt
Random Forest Analysis:
- File:
sb2c00328_si_011.csv
(~14 MB) - Content: 11,643 signal peptide variants with 156 physicochemical features
Neural Network Classifier:
- Files:
data/trainAA_*.parquet
,data/testAA_*.parquet
- Content: Pre-computed protein embeddings (ESM2, Ginkgo-AA0)
python quick_run.py
Interactive menu provides access to all functionality:
- Random Forest reproduction analysis
- Neural network training/testing
- System verification and tests
Random Forest Analysis:
# Complete reproduction
python grasso_reproduction_tool.py
# Data verification
python data_verification_tool.py
Neural Network Classifier:
# Train new model
python train_classifier.py
# Evaluate trained model
python test_classifier.py
# Random Forest
from grasso_reproduction_tool import execute_grasso_reproduction
results = execute_grasso_reproduction()
# Neural Network
from classifier import ProteinClassifierNN
from protein_embeddings_experiment import ProteinEmbeddingsExperiment
# Load embeddings
emb_exp = ProteinEmbeddingsExperiment(exp_manager, "data")
train_emb, test_emb, train_labels, test_labels = emb_exp.load_model_data('ginkgo_aa0')
# Train classifier
classifier = ProteinClassifierNN(n_bins=10)
model = classifier.train_model(train_emb, train_labels)
Features: 156 physicochemical descriptors
- N-region: Amino-terminal properties (24)
- H-region: Hydrophobic core (14)
- C-region: Carboxy-terminal (29)
- Ac-region: Post-cleavage (25)
- SP-region: Global properties (24)
- Cleavage sites: Position-specific (40)
Model Configuration:
- Algorithm: Random Forest Regressor
- Trees: 75
- Max depth: 25
- Feature sampling: All features per split
Architecture Options:
- Single layer: 256 units
- Two layer: 512 → 256 units
- Three layer: 512 → 256 → 128 units
Training Configuration:
- Loss functions: Categorical crossentropy, Focal loss
- Label encoding: Hard/soft targets
- Regularization: Dropout (0.3), L2 (0.001)
- Optimization: Adam (lr=0.0005)
Embeddings:
- ESM2-650M: 1280 dimensions
- ESM2-3B: 2560 dimensions
- Ginkgo-AA0: 1280 dimensions
Model | Test MSE | Test R² | MAE |
---|---|---|---|
Random Forest | 1.191 | 0.750 | 0.850 |
Neural Network | 0.950 | 0.800 | 0.744 |
gravy_SP
(0.159) - Hydrophobicity indexA_C
(0.062) - Alanine in C-region-1_A
(0.043) - Alanine at cleavage -1flexibility_N
(0.036) - N-region flexibilitypI_C
(0.033) - C-region isoelectric point
Random Forest:
grasso_reproduction_results.png
: 4-panel analysis figure- Console output: Performance metrics and feature rankings
Neural Network:
classifier_results/best_model_*.keras
: Trained modelsclassifier_results/all_results_*.csv
: Training historyclassifier_results/performance_plot_*.png
: Evaluation plots
# Core dependencies
pandas>=1.5.0
numpy>=1.21.0
scikit-learn>=1.1.0
scipy>=1.9.0
matplotlib>=3.5.0
# Neural network
tensorflow>=2.10.0
- Missing data files: Ensure all CSV and parquet files are in correct directories
- Memory errors: Reduce batch size or use subset of data
- Import errors: Verify all dependencies installed with correct versions
# Run comprehensive tests
python quick_run.py
# Select option 3: Run Comprehensive Tests
Wadhwa, M. (2025). Signal Peptide Efficiency Prediction:
Random Forest and Neural Network Approaches.
Mehak Wadhwa
Fordham University
Research Mentor: Dr. Joshua Schrier