Skip to content

autonlab/gp-molformer

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GP-MoLFormer

arXiv

Code and data for Generative Pre-trained Molecular Language transFormers (GP-MoLFormer) & pair-tuning.

GP-MoLFormer

GP-MoLFormer is a large-scale autoregressive chemical language model intended for molecule generation tasks. GP-MoLFormer employs the same architecture as MoLFormer-XL, including linear attention and rotary position embeddings, but uses decoder-only Transformer blocks trained with a causal language modeling objective. It is trained on up to 1.1B molecules in SMILES representation.

GP-MoLFormer was evaluated on de novo generation (at scale), scaffold-constrained decoration, and molecular property optimization tasks. Unconstrained property optimization was performed using a novel parameter-efficient fine-tuning method we call "pair-tuning". Pair-tuning is a soft prompt learning method which uses only ordered pairs of inputs to steer the model's generations in the direction implied by the data.

Models

Model Parameters Training size Link
GP-MoLFormer 46.8M 1.1B
GP-MoLFormer-Uniq 46.8M 650M Model on HF

Installation

We recommend using mamba for virtual environment management (although this can be substituted with conda).

mamba env create -f environment.yml

To use the LAMB optimizer, you need to install APEX:

pip install --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

Optionally, for pre-training, you should also install fast_transformers:

pip install pytorch-fast-transformers==0.4.0

Examples

For unconditional (de novo) generation, run:

python scripts/unconditional_generation.py --num_batches 1 uncond.csv

For conditional (e.g., scaffold-constrained) generation, run:

python scripts/conditional_generation.py c1cccc

For pair-tuning on QED, run:

python scripts/pairtune_training.py qed --lamb

Pre-training

GP-MoLFormer can be pre-trained using HuggingFace. See data/README.md for instructions on downloading the pre-training data.

Citation

@misc{ross2024gpmolformerfoundationmodelmolecular,
      title={GP-MoLFormer: A Foundation Model For Molecular Generation},
      author={Jerret Ross and Brian Belgodere and Samuel C. Hoffman and Vijil Chenthamarakshan and Youssef Mroueh and Payel Das},
      year={2024},
      eprint={2405.04912},
      archivePrefix={arXiv},
      primaryClass={q-bio.BM},
      url={https://arxiv.org/abs/2405.04912},
}

Disclosure

All content in these repositories including code has been provided by IBM under the associated open source software license and IBM is under no obligation to provide enhancements, updates, or support. IBM developers produced this code as an open source project (not as an IBM product), and IBM makes no assertions as to the level of quality nor security, and will not be maintaining this code going forward.

About

Code for GP-MoLFormer + pair-tuning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%