Skip to content

chandar-lab/NovoMolGen

Repository files navigation

NovoMolGen

Foundation models for de novo small-molecule generation.

NovoMolGen overview


Abstract

Designing \denovo molecules with desired property profiles requires efficient exploration of the vast chemical space ranging from $10^{23}$ to $10^{60}$ possible synthesizable candidates. While various deep generative models have been developed to design small molecules using diverse input representations, Molecular Large Language Models (Mol-LLMs) based on string representations have emerged as a scalable approach capable of exploring billions of molecules. However, there remains limited understanding regarding how standard language modeling practices such as textual representations, tokenization strategies, model size, and dataset scale impact molecular generation performance. In this work, we systematically investigate these critical aspects by introducing NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules for \denovo molecule generation. Through extensive empirical analyses, we identify a weak correlation between performance metrics measured during pretraining and actual downstream performance, revealing important distinctions between molecular and general NLP training dynamics. NovoMolGen establishes new state-of-the-art results, substantially outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecular generation tasks, thus providing a robust foundation for advancing efficient and effective molecular modeling strategies.

🤗 Checkpoints, tokenizers, and datasets: https://huggingface.co/collections/chandar-lab/novomolgen


Installation

# 1. Conda env with chemistry tool-chain
conda create -n NovoMol \
  -c conda-forge -c rdkit \
  python=3.10 rdkit openbabel openmm pdbfixer syba xtb xtb-python crest \
  lightgbm=4.3.0 deepsmiles=1.0.1
conda activate NovoMol

# 2. Python deps
pip install -r requirements.txt

# 3. (Optional) Flash-Attention for faster training
bash scripts/install_requirements_mila_cluster.sh

Prerequisites: Python: 3.10+ , CUDA: 11.8

Quick Start

1 · Tokenise a dataset (one-off)

python src/main.py tokenize_dataset \
    --config_name=ZINC_1B_smiles_atomwise 

2 · Pre-train

python src/main.py train \
    --config_name=train_ZINC_270M_atomwise

3 . Fine-tune for PMO task (REINVENT)

python src/main.py finetune \
    --config_name=finetune_PMO_ZINC_1B_atomwise_smiles_llama-32M 

Code Structure

(Jump into each sub-README for API docs, examples, and design notes.)

Citation

@misc{chitsaz2025novomolgenrethinkingmolecularlanguage,
      title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, 
      author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar},
      year={2025},
      eprint={2508.13408},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.13408}, 
}

About

Code for the paper "NovoMolGen: Rethinking Molecular Language Model Pretraining"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published