This repository contains the code of VitroBERT, a pretrained BERT based model with the ability to use biological and chemical data during pretraining stage
- Installation
- Data Preparation
- Model Architecture
- Training the Model
- Reproducing Results
- Making Predictions
- Results
- Citation
- Clone the repository:
git clone https://github.com/aidd-msca/VitroBERT.git
cd VitroBERT
- Create and activate a conda environment:
conda create -y -q -n VitroBERT -c rdkit rdkit=2019.03.1.0 python=3.7.3
conda activate VitroBERT
if Aalto University (triton):
module load mamba # specific to Aalto University (triton)
source activate VitroBERT
- Install dependencies:
pip install -e .
- Download and store pretrianed MolBERT model from here
The model requires two types of data:
- Pretraining data: In-vitro data with binary labels
- Fine-tuning data: Preclinical and Clinical data with binary labels
Download downstream and pretrianing data from here and place in the data/
directory
The input data should be with the following structure:
{
'SMILES': [...], # List of SMILES strings
'target1': [...], # Binary labels (0 or 1)
'target2': [...],
# ... additional properties
}
This step is performed to
- Filter assays and mlecules
- Filter incompatible molecules (metals, salt and >128)
- Split data into training and validation
# Run the complete preprocessing pipeline
sbatch scripts/preprocess_invitro_data.sh \
/path/to/data \
/path/to/conda/env \
/path/to/pretrained/MolBERT/weights
# 1. Normalize SMILES, filter assays and remove downstream molecules
python scripts/preprocess_invitro_data.py \
--invitro_input_path /path/to/input/chembl20.parquet \
--invivo_input_path /path/to/input/TG_GATES_SMILES.csv\
--output_path /path/to/output/pretraining_data/invitro_selected_assays.parquet \
--invitro_smiles_column smiles \
--invivo_smiles_column SMILES \
--min_pos_neg_per_assay 10 \
--save_plots \
--plot_path /path/to/output/pretraining_data/distribution_plots.png
# 2. Filter metals, salts, and molecules > 128, and compute MolBERT features (Baseline)
python scripts/featurizer.py \
--input_path /path/to/output/pretraining_data/invitro_selected_assays.parquet \
--output_dir /path/to/output/pretraining_data \
--pretrained_MolBERT_weights /path/to/pretrained/weights
# 3. Split invitro data into train and validation sets
python scripts/split_data.py \
--input_path /path/to/output/pretraining_data/invitro_filtered.pkl \
--output_dir /path/to/output/pretraining_data \
--split_type Random \
--test_size 0.05
# Alternative split types available:
# --split_type Stratified # Maintains label distribution across splits
# --split_type Scaffold # Groups molecules by Murcko scaffolds (recommended for chemical data)
The pipeline will generate the following files:
invitro_selected_assays.parquet
: Preprocessed invitro datainvitro_filtered.pkl
: Filtered data with MolBERT featuresinvitro_train.pkl
: Training setinvitro_val.pkl
: Validation setsplit_ratio_Random.csv
: Statistics about the data splitdistribution_plots.png
: Distribution plots of the data
Place your prepared data files in the data/
directory:
data/
├── rawdata
├── TG_GATES_SMILES.csv
├── chembl20.parquet
├── pretraining_data
├── invitro_train.pkl
├── invitro_val.pkl
├── invitro_pos_weight_distribution.csv
└── downstream_data
For detailed instructions on preprocessing downstream data (TG-GATES histopathology and biochemistry data), see the comprehensive guide in src/datasets/README.md
.
The model consists of three main components:
- BERT encoder for molecular representation
- Masked language modeling head for pre-training
- Task-specific heads for:
- ADME property prediction
- Physicochemical property prediction
Architecture details:
- BERT output dimension: 768
- Maximum sequence length: 128
- Hidden layer size: 2048
- Number of attention heads: 12
Use the following script to pretrain VitroBERT by using invitro data
sbatch scripts/BERT_invitro_ADME_pretraining.sh \
/path/to/invitro_data \
/path/to/conda/env \
/scripts/config/BERT_init_masking_physchem_invitro_head.yaml \
/molbert_100epochs
After pretraining, train MLP heads for downstream tasks (TG-GATES data) using the pretrained BERT features.
First, perform hyperparameter search using cross-validation to find optimal parameters:
sbatch scripts/MLP_with_BERT_hp_search.sh
This script runs a comprehensive hyperparameter search across:
- Alpha values: [0.0, 0.25, 0.5, 0.75, 1.0] (focal loss parameter)
- Gamma values: [1.0, 2.0, 3.0] (focal loss parameter)
- L2 regularization: [0.01, 0.05, 0.1, 0.25, 0.5, 1.0]
- Dropout rates: [0.5]
- All split types: Structure, ATC, Time, RandomPick
- Different pretrained epochs: init, 0, 4, 9
After hyperparameter search, train the final models using the best parameters:
sbatch scripts/MLP_with_BERT_best_model.sh
This script:
- Finds the best hyperparameters from the search results
- Trains final models with multiple random seeds (0-4)
- Evaluates on test sets for each split type and pretrained epoch
- Saves comprehensive results including metrics for all seeds
The downstream training generates:
- Hyperparameter search results:
hp_search_{split_type}_epoch_{epoch}/hp_*.json
- Final model results:
best_model_results_{split_type}_pretrained_epoch_{epoch}.json
- Model checkpoints and training logs via Weights & Biases
If you use this code in your research, please cite:
@article{VitroBERT,
title={VitroBERT - Modeling DILI by pretraining BERT on invitro data},
author={Muhammad Arslan Masood, Samuel Kaski, Anamya Ajjolli Nagaraja, Katia Belaid, Natalie Mesens, Hugo Ceulemans, Dorota Herman, Markus Heinonen},
journal={under review},
year={2025}
}