A machine learning model to predict which chemical compounds can fight HIV effectively, helping researchers focus on the most promising candidates.
This project uses QSAR (Quantitative Structure-Activity Relationship) modeling to predict HIV drug compound efficacy. It helps pharmaceutical researchers identify promising compounds before expensive lab testing.
Key Benefits:
- Reduce screening time from months to hours
- Focus resources on high-probability compounds
- Improve success rates in drug discovery
- Complete ML pipeline from data preprocessing to deployment
- REST API for real-time predictions
- Model monitoring and performance tracking
- Batch processing for large compound libraries
- Docker containerization for easy deployment
Source: NCI AIDS Antiviral Screen Data
- 40,000+ HIV-tested compounds
- Activity classes: CA (Active), CM (Moderately Active), CI (Inactive)
- EC50/IC50 measurements
- Molecular structure data
- Language: Python 3.8+
- ML Libraries: scikit-learn, XGBoost, RDKit
- API: FastAPI
- Database: PostgreSQL
- MLOps: MLflow
- Deployment: Docker
Metric | Value |
---|---|
Accuracy | 87.3% |
F1-Score | 0.84 |
Cohen's Kappa | 0.79 |
AUC-ROC | 0.91 |
# Clone repository
git clone https://github.com/pari1jay/Prediction-Model-HC.git
cd Prediction-Model-HC
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Development mode
python src/main.py
# Production with Docker
docker-compose up -d
import requests
# Predict compound activity
response = requests.post(
"http://localhost:8000/predict",
json={"smiles": "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"}
)
print(response.json())
from src.training.pipeline import TrainingPipeline
pipeline = TrainingPipeline()
pipeline.load_data("data/hiv_compounds.csv")
pipeline.preprocess()
pipeline.train()
pipeline.evaluate()
from src.prediction.predictor import CompoundPredictor
predictor = CompoundPredictor.load("models/best_model.pkl")
result = predictor.predict_smiles("CCO")
print(f"Activity: {result['activity']}, Confidence: {result['confidence']:.3f}")
python scripts/batch_predict.py --input compounds.csv --output predictions.csv
Prediction-Model-HC/
├── src/
│ ├── data/ # Data processing
│ ├── features/ # Feature engineering
│ ├── models/ # ML models
│ ├── training/ # Training pipeline
│ ├── prediction/ # Prediction service
│ └── api/ # REST API
├── data/ # Datasets
├── models/ # Trained models
├── scripts/ # Utility scripts
├── tests/ # Test files
└── docs/ # Documentation
- Data Integration: Merge screening results, EC50/IC50 values, and molecular structures
- Quality Control: Handle duplicates, conflicts, and missing data
- Feature Engineering: Calculate molecular descriptors and fingerprints
- Model Training: Train and validate multiple ML models
- Evaluation: Assess model performance using relevant metrics
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature
) - Commit your changes (
git commit -m 'Add new feature'
) - Push to the branch (
git push origin feature/new-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- National Cancer Institute for the AIDS Antiviral Screen Data
- RDKit community for cheminformatics tools
- Open source contributors
Pari Jay - GitHub
Project Link: https://github.com/pari1jay/Prediction-Model-HC