This repository contains the Python backend for the ChemXploreML desktop application, which implements the machine learning framework described in the paper: Machine Learning Pipeline for Molecular Property Prediction Using ChemXploreML.
Please visit the Documentation to download the desktop application. To access the desktop application source code, please visit the ChemXploreML repository.
ChemXploreML is a powerful machine learning framework designed for chemical space exploration and molecular property prediction. This Python backend provides the core functionality for:
- Molecular feature generation and representation
- Machine learning model training and evaluation
- Chemical space visualization
- Property prediction and uncertainty estimation
- Model interpretation and explainability
- Advanced ML Algorithms: Support for XGBoost, LightGBM, CatBoost, and scikit-learn models
- Chemical Space Analysis: Integration with PCA, UMAP, t-SNE, KernelPCA, PHATE, ISOMAP, LaplacianEigenmaps, TriMap and FactorAnalysis for dimensionality reduction
- Model Optimization: Hyperparameter tuning with Optuna
- Task Queue: Asynchronous processing with Redis and RQ
- Data Quality: Integration with CleanLab for data quality assessment
- Deep Learning: Support for transformer-based models and custom neural networks (soon to be added)
- Rye package manager
- Clone the repository:
git clone https://github.com/aravindhnivas/cxml_py.git
cd cxml_py
- Ensure you have Rye installed and create and activate a virtual environment:
rye sync
# for unix/macOS
source .venv/bin/activate
# or for windows
.venv\Scripts\activate
- Start the desktop application ChemXploreML.
- Navigate to the 'Settings' tab to start the server.
cxml_py/
├── src/
│ └── cxml_lib/ # Core library code
├── pyproject.toml # Project configuration
├── requirements.lock # Locked dependencies
└── README.md # This file
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this software in your research, please cite:
Marimuthu, A. N.; McGuire, B. A. Machine Learning Pipeline for Molecular Property Prediction Using ChemXploreML. J. Chem. Inf. Model. 2025. https://doi.org/10.1021/acs.jcim.5c00516.
For support, please open an issue in the GitHub repository or contact aravindhnivas28@gmail.com.
- Kelvin Lee's UMDA repository for the mol2vec implementation.
- Kelvin Lee's astrochem_embedding repository for the VICGAE implementation.
- The ML pipeline is inspired by K. Lee's Machine Learning of Interstellar Chemical Inventories paper.
I would like to thank the authors and maintainers of the following libraries for their invaluable contributions:
- NumPy - Array computing and linear algebra
- SciPy - Scientific computing and optimization
- Pandas - Data manipulation and analysis
- Dask - Parallel computing and task scheduling
- Scikit-learn - Machine learning algorithms
- XGBoost - Gradient boosting framework
- LightGBM - Light gradient boosting machine
- CatBoost - Gradient boosting on decision trees
- Optuna - Hyperparameter optimization
- SHAP - Model interpretability
- CleanLab - Data quality and label error detection
- PyTorch - Deep learning framework
- PyTorch Lightning - Deep learning training framework
- Transformers - State-of-the-art NLP
- Matplotlib - Plotting library
- Seaborn - Statistical data visualization
- PHATE - Dimensionality reduction
- UMAP - Uniform Manifold Approximation
- TriMap - Dimensionality reduction
- Flask - Web framework
- Redis - In-memory data store
- RQ - Task queue
- Flask-SocketIO - WebSocket support
- Rye - Python package manager
- PyInstaller - Application packaging