Code for Generalized ECG (gECG) for thiophene polymers paper
Electronic Coarse Graining (ECG) is a machine learning approach capable of predicting quantum mechanical (QM) optoelectronic properties of molecules (and materials) based directly on molecular configurations at the coarse-grained (CG) resolution.
This project involves the development of generalized ECG (gECG) models for thiophene polymers, including the necessary codes, datasets, and trained models. More information can be found in our preprint paper, "Generalized Electronic Coarse Graining for Thiophene Polymers," available here.
Clone the repository and set up the required environment:
git clone https://github.com/TheJacksonLab/gECG_thiophene.git
cd gECG_thiophene/
# Create a Conda environment
conda create --name your_env_name python=3.11
conda activate your_env_name
# Install required Python packages
pip install -r requirements.txt
pip install -e .
The data_generation/
directory provides the basic codes for generating polymer datasets. The workflow is as follows:
-
Create Polymer Sequences:
- Use
generate_polymer.py
to create polymer sequences usingrdkit
. - Modify
lengths_list
in the script to control the degree of polymerization.
- Use
-
Optional: Sample Conformations with Molecular Dynamics:
- Prepare the inputs for Lammps with the OPLS-AA force fields.
- This can be conducted with the LigPargen server or locally with
generate_lmp.py
, which requires BOSS installed locally.
- This can be conducted with the LigPargen server or locally with
- Run MD simulations.
- Sample inputs are provided in
./data_generation/tp_gen/in.nvt*
.
- Sample inputs are provided in
- Collect conformations from MD trajectories and prepare them in the format of ORCA inputs.
- Refer to
./data_generation/run_MD.py
.
- Refer to
- Prepare the inputs for Lammps with the OPLS-AA force fields.
-
Perform QM Calculations with ORCA:
- Sample inputs for ZINDO/S and DFT are provided in
./data_generation/tp_gen/*.inp
.
- Sample inputs for ZINDO/S and DFT are provided in
-
Collect ORCA Inputs and Prepare Datasets:
- AA resolution:
./data_generation/collect_data*.py
- CG resolutions:
./data_generation/CG/CG*.py
- AA resolution:
Once the dataset is generated or downloaded (see the Data section below), the gECG model can be trained. The model architecture is based on ComENet.
For a quick test of the model training and evalution,
cd scripts/
python train_CG3R.py
For completely training the gECG models at various CG resolutions, change the datasets accordingly.
For a sample model inference and evaluation,
cd scripts/
python evaluate.py
The outputs are saved into the output.txt
. The MAE and R2 are printed.
One example of fine-tuning gECG models is provided. It shows how to improve the model from the ZINDO/S precision to the DFT precision with a small DFT dataset.
cd scripts/
python fine_tuning.py
It involves a two-step (partial-then-full) fine-tuning, which in total takes around 5-10 min on a laptop CPU.
Due to the large size of the datasets, they need to be downloaded from Zenodo
Datasets of CG polymers at various resolutions are included. Details of the CG representations are provided in the paper.
For loading data,
import torch
dataset = torch.load(dataset_to_load)
Output files directly from QM calculations implemented with ORCA