A full-stack bioinformatics application that combines Python-based molecular analysis with a modern React/Next.js web interface for AI-powered drug discovery and compound evaluation.
- Node.js 18+ and npm
- Python 3.8+ with conda/pip
- Git for version control
- Clone the repository
git clone https://github.com/williamhuang3/ml-based-drug-identifier.git
cd ml-based-drug-identifier- Set up Python environment
# Install Python dependencies
pip install -r requirements.txt
# Or using conda
conda install -c rdkit rdkit -y
conda install -c conda-forge bash- Set up Node.js environment
# Install frontend dependencies
npm install- Start the development servers
# Option 1: Start both servers together
npm run dev-full
# Option 2: Start servers separately (two terminals)
npm run flask-dev # Terminal 1: Flask backend
npm run dev # Terminal 2: Next.js frontend- Open your browser Navigate to http://localhost:3000 (frontend) or http://localhost:5001 (API)
- Start the application with
npm run dev - Enter a target name (e.g., "Coronavirus", "EGFR") or ChemBL ID
- Click "Search & Analyze" to run the analysis pipeline
- View results in the organized tabs:
- Overview: Summary statistics and target information
- Compounds: Detailed compound data table
- Statistics: Mann-Whitney U test results
- Visualizations: Molecular descriptor plots
- ML Predictions: Random Forest regression results
# Run the Python analysis directly
python main.pyFollow the prompts to:
- Enter a biological target for analysis
- Wait for ChemBL data retrieval and processing
- Run PaDEL descriptor calculation:
bash padel.sh - View generated plots and statistical results
- Target Query: Search ChemBL database for compounds targeting specific proteins
- Data Preprocessing: Filter and clean compound data, remove duplicates
- Bioactivity Classification: Label compounds based on IC50 thresholds
- Molecular Descriptors: Calculate Lipinski descriptors using RDKit
- Statistical Testing: Perform Mann-Whitney U tests between active/inactive groups
- Visualization: Generate box plots, scatter plots, and distribution charts
- Machine Learning: Train Random Forest model using PaDEL descriptors
- Prediction: Generate IC50 predictions and evaluate model performance
-
IC50 Classification Thresholds:
- Active: ≤ 1,000 nM
- Intermediate: 1,000 - 10,000 nM
- Inactive: ≥ 10,000 nM
-
Lipinski Descriptors:
- Molecular Weight (MW)
- Lipophilicity (LogP)
- Hydrogen Bond Donors
- Hydrogen Bond Acceptors
-
Model Performance Metrics:
- R² Score (coefficient of determination)
- RMSE (Root Mean Square Error)
- MAE (Mean Absolute Error)
This project is licensed under the MIT License - see the LICENSE file for details.
- William Huang - Project Creator
- Data Professor (YouTube) - Inspiration and tutorials
- ChemBL Database - Compound and bioactivity data