Skip to content

Mehak-W/grasso-reproduction-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Signal Peptide Efficiency Prediction

Overview

Machine learning approaches for predicting signal peptide secretion efficiency in Bacillus subtilis. This repository contains two complementary methodologies:

  1. Random Forest Regression: Reproduction of physicochemical feature-based prediction
  2. Neural Network Classifier: Deep learning approach using protein embeddings

Performance Summary:

  • Random Forest: Test MSE 1.191 WA² (R² = 0.750)
  • Neural Network: Test MSE 0.950 WA² (R² = 0.800)

Repository Structure

grasso-reproduction-study/
├── README.md                          # Project documentation
├── requirements.txt                   # Python dependencies
├── quick_run.py                       # Interactive execution interface
│
├── Random Forest Components/
│   ├── grasso_reproduction_tool.py    # Main reproduction analysis
│   ├── 01_grasso_reproduction_complete.py  # Single-file implementation
│   ├── data_verification_tool.py      # Data integrity verification
│   ├── config.py                      # Configuration parameters
│   └── sb2c00328_si_011.csv          # Signal peptide dataset
│
├── Neural Network Components/
│   ├── classifier.py                  # Neural network classifier
│   ├── train_classifier.py            # Training pipeline
│   ├── test_classifier.py             # Model evaluation
│   ├── protein_embeddings_experiment.py  # Embedding data handling
│   └── experiment_framework.py        # Experiment management
│
├── data/                              # Protein embeddings
│   ├── trainAA_*.parquet             # Training embeddings
│   └── testAA_*.parquet              # Test embeddings
│
└── results/                           # Generated outputs
    ├── classifier_results/            # Neural network results
    └── grasso_reproduction_results.png # Random forest visualization

Installation

Prerequisites

  • Python 3.8+
  • 8GB RAM recommended

Setup

# Clone repository
git clone <repository-url>
cd grasso-reproduction-study

# Install dependencies
pip install -r requirements.txt

Data Requirements

Random Forest Analysis:

  • File: sb2c00328_si_011.csv (~14 MB)
  • Content: 11,643 signal peptide variants with 156 physicochemical features

Neural Network Classifier:

  • Files: data/trainAA_*.parquet, data/testAA_*.parquet
  • Content: Pre-computed protein embeddings (ESM2, Ginkgo-AA0)

Usage

Quick Start

python quick_run.py

Interactive menu provides access to all functionality:

  • Random Forest reproduction analysis
  • Neural network training/testing
  • System verification and tests

Direct Execution

Random Forest Analysis:

# Complete reproduction
python grasso_reproduction_tool.py

# Data verification
python data_verification_tool.py

Neural Network Classifier:

# Train new model
python train_classifier.py

# Evaluate trained model
python test_classifier.py

Programmatic Usage

# Random Forest
from grasso_reproduction_tool import execute_grasso_reproduction
results = execute_grasso_reproduction()

# Neural Network
from classifier import ProteinClassifierNN
from protein_embeddings_experiment import ProteinEmbeddingsExperiment

# Load embeddings
emb_exp = ProteinEmbeddingsExperiment(exp_manager, "data")
train_emb, test_emb, train_labels, test_labels = emb_exp.load_model_data('ginkgo_aa0')

# Train classifier
classifier = ProteinClassifierNN(n_bins=10)
model = classifier.train_model(train_emb, train_labels)

Methodologies

Random Forest Approach

Features: 156 physicochemical descriptors

  • N-region: Amino-terminal properties (24)
  • H-region: Hydrophobic core (14)
  • C-region: Carboxy-terminal (29)
  • Ac-region: Post-cleavage (25)
  • SP-region: Global properties (24)
  • Cleavage sites: Position-specific (40)

Model Configuration:

  • Algorithm: Random Forest Regressor
  • Trees: 75
  • Max depth: 25
  • Feature sampling: All features per split

Neural Network Approach

Architecture Options:

  • Single layer: 256 units
  • Two layer: 512 → 256 units
  • Three layer: 512 → 256 → 128 units

Training Configuration:

  • Loss functions: Categorical crossentropy, Focal loss
  • Label encoding: Hard/soft targets
  • Regularization: Dropout (0.3), L2 (0.001)
  • Optimization: Adam (lr=0.0005)

Embeddings:

  • ESM2-650M: 1280 dimensions
  • ESM2-3B: 2560 dimensions
  • Ginkgo-AA0: 1280 dimensions

Results

Performance Metrics

Model Test MSE Test R² MAE
Random Forest 1.191 0.750 0.850
Neural Network 0.950 0.800 0.744

Feature Importance (Random Forest)

  1. gravy_SP (0.159) - Hydrophobicity index
  2. A_C (0.062) - Alanine in C-region
  3. -1_A (0.043) - Alanine at cleavage -1
  4. flexibility_N (0.036) - N-region flexibility
  5. pI_C (0.033) - C-region isoelectric point

Generated Outputs

Random Forest:

  • grasso_reproduction_results.png: 4-panel analysis figure
  • Console output: Performance metrics and feature rankings

Neural Network:

  • classifier_results/best_model_*.keras: Trained models
  • classifier_results/all_results_*.csv: Training history
  • classifier_results/performance_plot_*.png: Evaluation plots

Requirements

# Core dependencies
pandas>=1.5.0
numpy>=1.21.0
scikit-learn>=1.1.0
scipy>=1.9.0
matplotlib>=3.5.0

# Neural network
tensorflow>=2.10.0

Troubleshooting

Common Issues

  1. Missing data files: Ensure all CSV and parquet files are in correct directories
  2. Memory errors: Reduce batch size or use subset of data
  3. Import errors: Verify all dependencies installed with correct versions

Verification

# Run comprehensive tests
python quick_run.py
# Select option 3: Run Comprehensive Tests

Citation

Wadhwa, M. (2025). Signal Peptide Efficiency Prediction: 
Random Forest and Neural Network Approaches.

Author

Mehak Wadhwa
Fordham University
Research Mentor: Dr. Joshua Schrier

About

Computational reproduction of Grasso et al. (2023) signal peptide efficiency prediction methodology

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published