You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+26-28Lines changed: 26 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,50 +1,49 @@
1
1
# Summary
2
2
3
-
This repository contains the code to reproduce the results of MRPA-LegNet, a variant of LegNet ([Paper](https://doi.org/10.1093/bioinformatics/btad457),
4
-
[Repo](https://github.com/autosome-ru/LegNet/)) that was specifically modified and optimized for predicting gene expression from human massive parallel reporter assays
5
-
performed with human K562, HepG2, and WTC11 cell lines.
3
+
This repository contains the code and instructions for MRPA-LegNet, a variant of LegNet ([Paper](https://doi.org/10.1093/bioinformatics/btad457),
4
+
[Repo](https://github.com/autosome-ru/LegNet/)) specifically optimized for predicting gene expression from human lentiMPRAs (massive parallel reporter assays).
5
+
The model is built and tested using the data obtained with human K562, HepG2, and WTC11 cell lines.
6
6
7
7
# Installation
8
8
9
9
## Setting up environment
10
10
11
-
Please install all requirements with`environment.txt` or `environment.yml` with `conda` (see [Docs](https://conda.io/projects/conda/en/latest/user-guide/install/index.html)) `mamba` (see [Docs](https://mamba.readthedocs.io/en/latest/mamba-installation.html)):
11
+
Please start by installing the requirements from`environment.txt` or `environment.yml` with `conda` (see [Docs](https://conda.io/projects/conda/en/latest/user-guide/install/index.html)) `mamba` (see [Docs](https://mamba.readthedocs.io/en/latest/mamba-installation.html)):
12
12
```
13
13
cd human_legnet
14
14
mamba env create -f envs/environment.yml
15
15
mamba activate legnet
16
16
```
17
17
18
-
Please refer to the envs/environment.yml for technical details such as installed packages and version numbers.
18
+
Please refer to the envs/environment.yml for technical details regarding preinstalled packages and package version numbers.
19
19
20
20
## Software dependencies and operating systems
21
21
22
-
MPRA-LegNet was successfully tested on various Ubuntu LTS releases (including but not limited to 20.04.3).
22
+
MPRA-LegNet was successfully tested on multiple Ubuntu Linux releases (including but not limited to 20.04.3).
23
23
24
24
## Software and hardware requirements
25
25
26
26
To train model from the publication you'll need 7GB GPU memory and 32 GB CPU.
27
27
28
-
29
-
30
28
# Model training and cross-validation
31
29
32
-
10-fold cross-validation was used for the model training and testing. The respective script is python vikram/human_legnet/core.py which be used as follows:
30
+
10-fold cross-validation was used for the model training and testing. The respective script is python vikram/human_legnet/core.py which should be used as follows:
33
31
34
32
```
35
33
python core.py --model_dir <target dir to save the models checkpoint files> --data_path <experiment data tables used in study> --epoch_num <epochnum> --use_shift --reverse_augment
36
34
```
37
35
38
-
Please use --help for more details regarding the command line parameters.
36
+
Please use `--help` for more details regarding the command line parameters.
39
37
40
-
This script was also used to test the impact of data augmentation on the model performance, e.g, adding --use_reverse_channel will add the respective channel to the input.
38
+
This script was also used to test the impact of data augmentation on the model performance, e.g,
39
+
adding `--use_reverse_channel` will add the respective channel to the input.
41
40
42
41
In the study, the shift was set to the size of the forward adapter (21 bp).
43
42
44
-
The data used to train model is available at `datasets/original` dir.
45
-
Files with name format `<cell_line>.tsv` were used for training and files with name format `<cell_line>_averaged.tsv` were used for accessing model final performance.
43
+
For convenience, the data used to train the model is available at `datasets/original` dir.
44
+
Files with name format `<cell_line>.tsv` were used for training and files with name format `<cell_line>_averaged.tsv` were used for accessing model's final performance.
46
45
47
-
Trained models can be dowloaded[here](https://disk.yandex.ru/d/ABO-qfuYuuqCww)
46
+
Pre-trained models as well as the respective experimental data can be downloaded[here](https://zenodo.org/records/8219231).
This command will take about 5 minutes to complete on 1 NVIDIA GeForce RTX 3090 using 8 CPUs.
58
57
59
-
It will produce dir demo with:
58
+
It will produce `demo` dir containing:
60
59
61
60
1.`config.json` -- model config required to run predictions
62
-
2.`model_2_1` containing model weights `lightning_logs/version_0/checkpoints/last_model-epoch=24.ckpt` and predictions for test fold in file `predictions_new_format.tsv`
63
-
64
-
File `predictions_new_format.tsv` contains predictions for forward and reverse sequence orientation in columns `forw_pred` and `rev_pred` respectively
61
+
2.`model_2_1` containing model weights `lightning_logs/version_0/checkpoints/last_model-epoch=24.ckpt` and predictions for the test fold in `predictions_new_format.tsv` file
65
62
63
+
The `predictions_new_format.tsv` file will contain predictions for forward and reverse sequence orientation in columns `forw_pred` and `rev_pred` respectively.
66
64
67
-
# Assessing LegNet performance in predicting the allele-specific events
65
+
# Assessing LegNet performance in predicting the allele-specific variants
68
66
69
-
To get predictions of all cross-validation models we used asb_predict.py for [ADASTRA](https://adastra.autosome.org) (allele-specific transcription factor binding) and coverage_predict.py for [UDACHA](https://udacha.autosome.org) (allele-specific chromatin accessibility) datasets.
67
+
To get predictions from all cross-validation models we used asb_predict.py for [ADASTRA](https://adastra.autosome.org) (allele-specific transcription factor binding) and coverage_predict.py for [UDACHA](https://udacha.autosome.org) (allele-specific chromatin accessibility) datasets.
python coverage_predict.py --config <model config> --model <models dir> --cov_path <path to the UDACHA dataset> --genome <path to human genome> --out_path <out file> --device 0 --max_shift 0
77
75
```
78
-
The datasets are available in the repository, see the datasets/asb and datasets/coverage folders.
76
+
The datasets are available in the repository, see the `datasets/asb` and `datasets/coverage` folders.
79
77
80
-
The scripts compare the predicted and the real effect direction of single-nucleotide variants for the alternative against the reference allele at allele-specific regulatory sites. That is whether the reference or the alternative allele is preferable for transcription factor binding or for chromatin accessibility.
78
+
The scripts compare the predicted and the real effect direction of single-nucleotide variants for the alternative against the reference allele at allele-specific regulatory sites. That is whether the reference or the alternative allele is preferable for transcription factor binding or chromatin accessibility.
81
79
82
-
The post-processing of predictions is performed with Jupyter-Notebook variant_annotation.ipynb.
80
+
The post-processing of predictions is performed with Jupyter-Notebook `variant_annotation.ipynb`.
83
81
84
-
Estimating the performance metrics (Fisher's exact test, R-squared calculation) for the processed predictions is performed using R script analyze_concordance.R
82
+
Estimating the performance metrics (Fisher's exact test, R-squared calculation) for the processed predictions is performed using R script `analyze_concordance.R`.
85
83
86
84
# Performing LegNet predictions for a user-supplied fasta
87
85
88
-
It is possible to run a LegNet model against a user-supplied fasta to obtain predictions for each sequence in it. This can be achieved with fasta_predict.py.
86
+
It is possible to run the LegNet model against a user-supplied fasta to obtain predictions for each sequence. This can be achieved with `fasta_predict.py`.
0 commit comments