Madrigal is an open-source model for predicting drug combination outcomes from multimodal preclinical data. This repository provides the implementation of the model as described in our project page and our paper.
- Clone this Github repository and install following the section below.
- Set up data directories and create a
.env
file (see below). - [Optional] Download datasets from our data repo in Harvard Dataverse and reorganize according to your
.env
setup. - [Optional] Download pretrained checkpoints from our checkpoint repo in Huggingface and reorganize according to your
.env
setup.
We provide sample model pretraining (second-stage modality alignment) and training scripts in scripts/
. Specifically, the second-stage pretraining scripts are provided in ./scripts/cl_pretrain/
, and the fine-tuning scripts are provided in ./scripts/ddi_finetune/
. The scripts will need to be adapted according to your machine.
The first-stage modality adaptation training scripts (or notebooks) and checkpoints can be found in modality_pretraining/
. You can also run inference with model checkpoints using sample Jupyter notebooks:
- generate_embeddings: Generate embeddings and raw scores. Also contains scripts to normalize prediction scores so that they can be used for direct comparisons.
- quick_predictions: Get raw scores and normalized ranks for specific queries of [outcome, drug A, drug B].
Notebook | Content | Requirement |
---|---|---|
fig1_pretrained_embeds | Plot UMAP of pretrained modality embeddings (Fig. 1d) | Harvard Dataverse, Huggingface |
fig2_model_analyses | Performance change with drug similarity (Fig. 2c,d) | Harvard Dataverse, Huggingface |
fig2_modality_ablations | Outcome-specific AUPRC of Madrigal and ablation models (Fig. 2e) | Harvard Dataverse, Huggingface |
fig3_self_combo | External validation with FDA safety rankings (Fig. 3a-c) | Harvard Dataverse, Huggingface |
fig3_transporter_mediated_ddis | Transporter-mediated DDIs (Fig. 3d-f) | Harvard Dataverse, Huggingface, Normalized rank |
fig4_clinical_trials_combos | Evaluation with clinical trials data (Fig. 4b) | Harvard Dataverse, Huggingface, ToolUniverse |
fig4_parpi | Evaluation with PARPi combinations (Fig. 4c) | Harvard Dataverse, Huggingface, Normalized rank |
fig5_t2d_mash | Evaluation with combinations in metabolic disorders (Fig. 5) | Harvard Dataverse, Huggingface, Normalized rank |
fig6_PDX | Individualized predictions with PDXE (Fig. 6c-f) | Harvard Dataverse, Huggingface |
fig6_clinical_validation_dfci | Analyses with DFCI cohort (Fig. 6j) | Harvard Dataverse, Huggingface; (access to patient data is restricted) |
discussions_proteomics_analysis | Correlation with proteomics data | Harvard Dataverse, Huggingface |
discussions_combomatch | Inference on ComboMATCH drug pairs | Harvard Dataverse, Huggingface, Normalized rank |
Requirements:
- Harvard Dataverse: Required for running all notebooks.
- Huggingface: Required for running all notebooks.
- Normalized rank: The full normalized rank tensor (80GB) used in some notebooks.
- ToolUniverse: Used to extract clinical trials adverse events data.
Currently, modifications of the codebase are required to enable adaptation of the model to your own dataset. Below is an outline of possible preparations.
- There are certain arguments that require modifications (see
./madrigal/parse_args.py
) if you are incorporating a new dataset.data_source
: This arg affects path to load data and training and evaluation strategy.split_method
: This arg affects path to load data and evaluation strategy.task
: Depending on the nature of your dataset, you might want to change this.loss_fn_name
: Depending ontask
, you might want to change or reimplement this.
- Preparing data: Please refer to our provided data files for the exact formatting of each file.
- Drugs
- Metadata: Key to all other files.
- Modality data
- Structure: Use
torchdrug
to generate molecular graphs in the same way as molecules are ordered in metadata. - KG: Use
PyG
to generateHeteroData
objects, making sure drug node indices are ordered in the same way as in metadata. - Cell viability: Mainly tables.
- Transcriptomics: Mainly tables.
- Note that you will need to regenerate a file (hard-coded as
rdkit2D_embeddings_combined_all_normalized.parquet
) for chemCPA usage/pretraining.
- Note that you will need to regenerate a file (hard-coded as
- Structure: Use
- Drug combination outcomes
- Tables of (label_indexed, head (drug 1), tail (drug 2), negs*) (depending on dataset splitting strategy, the negative columns will have different meanings).
- Mapping between outcome label index and outcome information.
- Drugs
Before installing madrigal
, please set up a new conda environment through mamba env create -f env_new.yaml
(this process should take less than an hour; see mamba
installation guidelines here). By default, our environment is with CUDA 11.7 + gcc 9.2. Please edit env_new.yaml
accordingly if you are installing in another CUDA version. We welcome contributions of instructions on setting up the environment with other version control managers such as uv
.
Then, activate this environment with mamba activate primekg
. To install a global reference to madrigal
package in your interpreter (e.g. from madrigal.X import Y
), run the following:
cd /path/to/Madrigal
python -m pip install -e .
Then, test the install by trying to import madrigal
in the interpreter:
python -c "import madrigal; print('Imported')"
Now you should be able to use import madrigal
from anywhere on your system, as long as you use the same python interpreter.
We organize our data and model output folders in the following way:
Madrigal_Data (BASE_DIR)
|-- processed_data
| |-- polypharmacy_new (in Harvard Dataverse)
| | |-- DrugBank
| | | |-- split_by_*
| | | | |-- data tables
| |-- views_features_new (in Harvard Dataverse)
| | |-- metadata tables
| | |-- str
| | | |-- torchdrug-generated molecular graphs
| | |-- kg
| | | |-- PyG-generated KGs
| | |-- cv
| | | |-- cell viability tables
| | |-- tx
| | | |-- transcriptomics tables
|-- model_output
| |-- pretrain (in Huggingface)
| | |-- DrugBank
| | | |-- split_by_*
| |-- DrugBank (in Huggingface)
| | |-- split_by_*
|-- raw_data (data used in analyses, in Harvard Dataverse)
This structure is reflected in the model code. Please make necessary edits if you are using a different organization.
Then, please add a file .env
to the project directory (root of this project) and specify the following paths (with /
at the end):
PROJECT_DIR=/path/to/Madrigal/
BASE_DIR=/path/to/Madrigal_Data/
DATA_DIR=/path/to/Madrigal_Data/processed_data/
ENCODER_CKPT_DIR=/path/to/Madrigal/modality_pretraining/
CL_CKPT_DIR=/path/to/Madrigal_Data/model_output/pretrain/
The code in this package is licensed under the MIT License.
- The
torchdrug
module needs to be imported after importingtorch_geometric
modules. torchdrug>=0.2.0.post1
is required, as earlier versions cause an issue in LR scheduler.- We use
pytorch=1.13.1
, which requirescuda<12.0
.
- If you have only
cuda>12.0
, our incomplete test indicates thatpytorch=2.1.0
withpytorch-geometric<2.4.0
might be compatible (CUDA 12.8 + gcc 14.2 + PyTorch 2.1.0).
- (Updated
env_new.yaml
to resolve this issue.) ~If you encounterTypeError: canonicalize_version() got an unexpected keyword argument 'strip_trailing_zero'
while installing, please check out this post. In summary, eithersetuptools<71
orpackaging>=22
is required.
Please find our preprint at https://arxiv.org/abs/2503.02781.
@article{Huang2025.arXiv:2503.02781,
author = {Huang, Yepeng and Su, Xiaorui and Ullanat, Varun and Liang, Ivy and Clegg, Lindsay and Olabode, Damilola and Ho, Nicholas and John, Bino and Gibbs, Megan and Zitnik, Marinka},
title = {Multimodal AI predicts clinical outcomes of drug combinations from preclinical data},
journal = {arXiv preprint arXiv:2503.02781},
year = {2025},
doi = {10.48550/arXiv.2503.02781},
URL = {https://arxiv.org/abs/2503.02781},
}