GP-MoLFormer

Code and data for Generative Pre-trained Molecular Language transFormers (GP-MoLFormer) & pair-tuning.

GP-MoLFormer is a large-scale autoregressive chemical language model intended for molecule generation tasks. GP-MoLFormer employs the same architecture as MoLFormer-XL, including linear attention and rotary position embeddings, but uses decoder-only Transformer blocks trained with a causal language modeling objective. It is trained on up to 1.1B molecules in SMILES representation.

GP-MoLFormer was evaluated on de novo generation (at scale), scaffold-constrained decoration, and molecular property optimization tasks. Unconstrained property optimization was performed using a novel parameter-efficient fine-tuning method we call "pair-tuning". Pair-tuning is a soft prompt learning method which uses only ordered pairs of inputs to steer the model's generations in the direction implied by the data.

Models

Model	Parameters	Training size	Link
GP-MoLFormer	46.8M	1.1B
GP-MoLFormer-Uniq	46.8M	650M

Installation

We recommend using mamba for virtual environment management (although this can be substituted with conda).

mamba env create -f environment.yml

To use the LAMB optimizer, you need to install APEX:

pip install --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

Optionally, for pre-training, you should also install fast_transformers:

pip install pytorch-fast-transformers==0.4.0

Examples

For unconditional (de novo) generation, run:

python scripts/unconditional_generation.py --num_batches 1 uncond.csv

For conditional (e.g., scaffold-constrained) generation, run:

python scripts/conditional_generation.py c1cccc

For pair-tuning on QED, run:

python scripts/pairtune_training.py qed --lamb

Pre-training

GP-MoLFormer can be pre-trained using HuggingFace. See data/README.md for instructions on downloading the pre-training data.

Citation

@misc{ross2024gpmolformerfoundationmodelmolecular,
      title={GP-MoLFormer: A Foundation Model For Molecular Generation},
      author={Jerret Ross and Brian Belgodere and Samuel C. Hoffman and Vijil Chenthamarakshan and Youssef Mroueh and Payel Das},
      year={2024},
      eprint={2405.04912},
      archivePrefix={arXiv},
      primaryClass={q-bio.BM},
      url={https://arxiv.org/abs/2405.04912},
}

Disclosure

All content in these repositories including code has been provided by IBM under the associated open source software license and IBM is under no obligation to provide enhancements, updates, or support. IBM developers produced this code as an open source project (not as an IBM product), and IBM makes no assertions as to the level of quality nor security, and will not be maintaining this code going forward.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
data		data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GP-MoLFormer

Models

Installation

Examples

Pre-training

Citation

Disclosure

About

Uh oh!

Releases

Packages

Languages

License

autonlab/gp-molformer

Folders and files

Latest commit

History

Repository files navigation

GP-MoLFormer

Models

Installation

Examples

Pre-training

Citation

Disclosure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages