- Table of contents
- Introduction
- Installation
- Usage examples
- Running in-silico mutagenesis
- Plotting results of in-silico mutagenesis
- Citation
PARM (Promoter Activity Regulatory Model) is a deep learning model that predicts promoter activity from the DNA sequence itself. As a convolutional neural network trained on MPRA data, PARM is very lightweight and produces predictions in a cell-type-specific manner.
With the PARM predict
tool, you can get predictions for any sequence that you want for AGS, HAP1, HCT116, HEK116, HepG2, K562, LNCaP, MCF7, and U2OS cells.
With PARM mutagenesis
, in addition to simple promoter activity scores, PARM can also produce the so-called in-silico mutagenesis plot.
This is useful for predicting which TFs are regulating (activating or repressing) your sequence. (read more on Running in-silico mutagenesis).
PARM can be easily installed with conda
:
conda install -c anaconda -c conda-forge -c bioconda -c pytorch parm
To predict the promoter activity in K562 of every sequence in a fasta file, run:
parm predict \
--input example_data/input.fasta \
--output output_K562.txt \
--model pre_trained_models/K562/
Note that you should replace
pre_trained_models/K562/
with the actual path to the pre-trained models available on this page. Also, note that a PARM model is composed of five different folds, as each model is trained five times. If you check the content ofpre_trained_models/K562/
, you will see the.parm
files there, one for each fold. Do not rename or change the files there unless you know what you are doing.
The output is a tab-separated file. The first and second columns contain information about the sequence (the sequence and its header). The following column contains the predicted promoter activity for the model you have selected.
For the command line above, you should expect the following result:
sequence | header | prediction_K562 |
---|---|---|
CTGGGAGG... | CXCR4_chr2:136875708:136875939:- | 2.287095785140991 |
GCAACTAA... | MED16_chr19:893131:893362:- | 2.22406268119812 |
ACGCCCAG... | TERT_chr5:1295135:1295366:- | 1.993780255317688 |
To compute the in-silico mutagenesis for every sequence in a fasta file, run:
parm mutagenesis \
--input example_data/input.fasta \
--output in_silico_mutagenesis_K562 \
--model pre_trained_models/K562/
For every sequence in the input fasta, PARM will predict the effect of every possible mutation of every single base pair.
This result is stored as a matrix, where every row is a nucleotide of the original sequence and the columns are A, C, G, and T; the values in the matrix are the predicted mutation effect.
PARM uses the mutagenesis matrix to scan for known transcription factor (TF) binding sites.
(As default, PARM uses the core human database from HOCOOMOCOv11 as the motif dataset, which can be changed with the --motif_database
parameter.)
The output of PARM mutagenesis
is a directory where, for every sequence, both the mutagenesis matrix (mutagenesis_*.txt.gz
) and the scanned TF motifs (hits_*txt.gz
) are stored.
Results of in-silico mutagenesis are more insightful when visualised in the following format:
You can easily see the mutagenesis matrix and all the scanned TF motifs.
To produce such a visualisation, you can run:
parm plot \
--input in_silico_mutagenesis_K562/CXCR4_chr2:136875708:136875939:-/
This will read the mutagenesis matrix and the hits for the sequence sequence_of_interest
and generate the plot.
By default, PARM stores the result plot as a PDF inside the input dir.
This can be changed using optional arguments.
Run parm plot --help
for additional help on that.
If you want to train a PARM model with your MPRA data, you must pre-process the raw MPRA counts using our pre-processing pipeline.
This will produce, mainly, one-hot encoded files with the promoter activity per fragment, per cell.
In the example_data/training_data
directory, we provide an example of this for the AGS and HAP1 cells.
We always train PARM models five independent times, using different folds (splits of the data) for validation.
To train the PARM models for the AGS cell, you can run:
# Fold 0 model
parm train \
--input example_data/training_data/fold[1234].* \
--validation example_data/training_data/fold0.hdf5 \
--output AGS_fold0 \
--cell_type AGS
This will create the AGS_fold0
folder, with the following structure:
AGS_fold0/
├── AGS_fold0.parm
├── performance_stats/
| ├── loss_per_epoch.png
| ├── loss_per_epoch.txt
│ ├── validation_scatter_0.svg
│ ├── validation_scatter_1.svg
│ |-- ...one per epoch
│ └── validation_scatter_6.svg
└── temp_models/
├── model_epoch_0.pth
├── model_epoch_1.pth
|-- ...one per epoch
└── model_epoch_6.pth
In this case, AGS_fold0.parm
is the main result.
In the performance_stats
dir, you will find the training and validation loss of each training epoch ( loss_per_epoch.*
files), as well as the scatter plot of the measured vs. predicted Log2RPM values for each epoch (validation_scatter_*.svg
files).
In the temp_models
dir, you can find the intermediate models for each training epoch (temp_models
dir).
Similarly, for the other folds, you can run:
# Fold 1 model
parm train \
--input example_data/training_data/fold[0234].* \
--validation example_data/training_data/fold1.hdf5 \
--output AGS_fold1 \
--cell_type AGS
# Fold 2 model
parm train \
--input example_data/training_data/fold[0134].* \
--validation example_data/training_data/fold2.hdf5 \
--output AGS_fold2 \
--cell_type AGS
# Fold 3 model
parm train \
--input example_data/training_data/fold[0124].* \
--validation example_data/training_data/fold3.hdf5 \
--output AGS_fold3 \
--cell_type AGS
# Fold 4 model
parm train \
--input example_data/training_data/fold[0123].* \
--validation example_data/training_data/fold4.hdf5 \
--output AGS_fold4 \
--cell_type AGS
After training all the folds, you should place all the folds in a single directory:
mkdir my_AGS_model
cp AGS_fold0/AGS_fold0.parm \
AGS_fold1/AGS_fold1.parm \
AGS_fold2/AGS_fold2.parm \
AGS_fold3/AGS_fold3.parm \
AGS_fold4/AGS_fold4.parm \
my_AGS_model/
Now, you can evaluate the model using the test fold. This is part of your dataset that was excluded from the training. Therefore, a standard evaluation of the model is to compare the measured and predicted promoter activity of the fragments in this fold.
For this, you can make use of the --predict_test_fold
flag of the PARM predict
, as follows:
parm predict \
--predict_test_fold \
--input example_data/training_data/test.hdf5 \
--output my_AGS_model_test \
--model my_AGS_model/
This will create my_AGS_model_test
directory containing the scatter plots showing the correlation between measured and predicted activity, both at the fragment and feature levels (averaging fragments of the same regulatory features).
- The provided data in the
example_data/training_data
is not enough to train a good PARM model. We provide it here solely for this tutorial. - Always run the
PARM train
function from a GPU server. A normal CPU machine will take a long time to train a model, even with the provided example data. At the start of the training, PARM will print on the screen if a GPU is detected. Make sure that you seeGPU detected? True
. You can also runparm train --check_cuda
; this will check if any GPU is detected and exit. - Even if your input data contains measurements for more than one cell (as the provided example, which contains data for AGS and HAP1), you can only train a model for one cell at a time.
If you make use of PARM and/or this pipeline, please cite: