NovoMolGen

Foundation models for de novo small-molecule generation.

Abstract

Designing \denovo molecules with desired property profiles requires efficient exploration of the vast chemical space ranging from $10^{23}$ to $10^{60}$ possible synthesizable candidates. While various deep generative models have been developed to design small molecules using diverse input representations, Molecular Large Language Models (Mol-LLMs) based on string representations have emerged as a scalable approach capable of exploring billions of molecules. However, there remains limited understanding regarding how standard language modeling practices such as textual representations, tokenization strategies, model size, and dataset scale impact molecular generation performance. In this work, we systematically investigate these critical aspects by introducing NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules for \denovo molecule generation. Through extensive empirical analyses, we identify a weak correlation between performance metrics measured during pretraining and actual downstream performance, revealing important distinctions between molecular and general NLP training dynamics. NovoMolGen establishes new state-of-the-art results, substantially outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecular generation tasks, thus providing a robust foundation for advancing efficient and effective molecular modeling strategies.

🤗 Checkpoints, tokenizers, and datasets: https://huggingface.co/collections/chandar-lab/novomolgen

Installation

# 1. Conda env with chemistry tool-chain
conda create -n NovoMol \
  -c conda-forge -c rdkit \
  python=3.10 rdkit openbabel openmm pdbfixer syba xtb xtb-python crest \
  lightgbm=4.3.0 deepsmiles=1.0.1
conda activate NovoMol

# 2. Python deps
pip install -r requirements.txt

# 3. (Optional) Flash-Attention for faster training
bash scripts/install_requirements_mila_cluster.sh

Prerequisites: Python: 3.10+ , CUDA: 11.8

Quick Start

1 · Tokenise a dataset (one-off)

python src/main.py tokenize_dataset \
    --config_name=ZINC_1B_smiles_atomwise

2 · Pre-train

python src/main.py train \
    --config_name=train_ZINC_270M_atomwise

3 . Fine-tune for PMO task (REINVENT)

python src/main.py finetune \
    --config_name=finetune_PMO_ZINC_1B_atomwise_smiles_llama-32M

Code Structure

configs: Contains YAML files for configuring datasets, models, trainers, fine-tuning.
scripts: Bash helpers (environment setup, etc.).
notebooks
- checkpoint_quickstart.ipynb – Quickstart: Loads the NovoMolGen-32M checkpoint + tokenizer from 🤗 Hub, samples 3k SMILES, and evaluates the six unconstrained metrics. Renders a tidy dataframe mirroring Table 1.
- goal_directed_optimization.ipynb – Goal-directed demo: Runs a short AugmentedHC/REINVENT loop (e.g., Perindopril_MPO), logs rewards, and plots the training curve.
src: All Python source code
- data_loader: dataset & tokenization (see src/data_loader/README.md)
- models: all model classes and helpers used in NovoMolGen:
  - modeling_novomolgen.py: Flash-Attention Llama variant used as the NovoMolGen backbone.
  - model_with_value_head.py: adds an MLP value-head for RL tasks (PPO).
  - modeling_utils.py: Inference helpers (generate_valid_smiles, etc.) that post-process raw generations into valid, canonical SMILES.
- trainer: training & RL loops
  - hf_trainer.py: thin wrapper around transformers.Trainer with chemistry-specific callbacks.
  - reinvent_trainer.py: REINVENT implementation on top of NovoMolGen.
  - augment_hc_trainer.py: Augmented-Hill-Climb trainer.
  - (see src/trainer/README.md)
- eval: goal-directed & unconstrained molecule metrics
  - molecule_evaluation.py: computes validity, uniqueness, novelty, MPO tasks, docking wrappers, etc.
  - (see src/eval/README.md)
- callbacks: logging & evaluation callbacks (e.g. WandB).
- REINVENT: verbatim fork of the original https://github.com/MarcusOlivecrona/REINVENT. kept only for result replication; our custom trainer lives in src/trainer/reinvent_trainer.py.
- main.py: single Fire-based CLI (train, finetune, tokenize_dataset).

(Jump into each sub-README for API docs, examples, and design notes.)

Citation

@misc{chitsaz2025novomolgenrethinkingmolecularlanguage,
      title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, 
      author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar},
      year={2025},
      eprint={2508.13408},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.13408}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
configs		configs
data		data
notebooks		notebooks
scripts		scripts
src		src
static		static
.gitattributes		.gitattributes
.gitignore		.gitignore
.nojekyll		.nojekyll
.pre-commit-config.yaml		.pre-commit-config.yaml
.project-root		.project-root
README.md		README.md
index.html		index.html
pixi.lock		pixi.lock
pixi.toml		pixi.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NovoMolGen

Abstract

Installation

Quick Start

1 · Tokenise a dataset (one-off)

2 · Pre-train

3 . Fine-tune for PMO task (REINVENT)

Code Structure

Citation

About

Uh oh!

Releases

Packages

Languages

chandar-lab/NovoMolGen

Folders and files

Latest commit

History

Repository files navigation

NovoMolGen

Abstract

Installation

Quick Start

1 · Tokenise a dataset (one-off)

2 · Pre-train

3 . Fine-tune for PMO task (REINVENT)

Code Structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages