Skip to content

Commit 5fc1855

Browse files
Update README.md
1 parent b2c1a20 commit 5fc1855

File tree

1 file changed

+26
-28
lines changed

1 file changed

+26
-28
lines changed

README.md

Lines changed: 26 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,49 @@
11
# Summary
22

3-
This repository contains the code to reproduce the results of MRPA-LegNet, a variant of LegNet ([Paper](https://doi.org/10.1093/bioinformatics/btad457),
4-
[Repo](https://github.com/autosome-ru/LegNet/)) that was specifically modified and optimized for predicting gene expression from human massive parallel reporter assays
5-
performed with human K562, HepG2, and WTC11 cell lines.
3+
This repository contains the code and instructions for MRPA-LegNet, a variant of LegNet ([Paper](https://doi.org/10.1093/bioinformatics/btad457),
4+
[Repo](https://github.com/autosome-ru/LegNet/)) specifically optimized for predicting gene expression from human lentiMPRAs (massive parallel reporter assays).
5+
The model is built and tested using the data obtained with human K562, HepG2, and WTC11 cell lines.
66

77
# Installation
88

99
## Setting up environment
1010

11-
Please install all requirements with `environment.txt` or `environment.yml` with `conda` (see [Docs](https://conda.io/projects/conda/en/latest/user-guide/install/index.html)) `mamba` (see [Docs](https://mamba.readthedocs.io/en/latest/mamba-installation.html)):
11+
Please start by installing the requirements from `environment.txt` or `environment.yml` with `conda` (see [Docs](https://conda.io/projects/conda/en/latest/user-guide/install/index.html)) `mamba` (see [Docs](https://mamba.readthedocs.io/en/latest/mamba-installation.html)):
1212
```
1313
cd human_legnet
1414
mamba env create -f envs/environment.yml
1515
mamba activate legnet
1616
```
1717

18-
Please refer to the envs/environment.yml for technical details such as installed packages and version numbers.
18+
Please refer to the envs/environment.yml for technical details regarding preinstalled packages and package version numbers.
1919

2020
## Software dependencies and operating systems
2121

22-
MPRA-LegNet was successfully tested on various Ubuntu LTS releases (including but not limited to 20.04.3).
22+
MPRA-LegNet was successfully tested on multiple Ubuntu Linux releases (including but not limited to 20.04.3).
2323

2424
## Software and hardware requirements
2525

2626
To train model from the publication you'll need 7GB GPU memory and 32 GB CPU.
2727

28-
29-
3028
# Model training and cross-validation
3129

32-
10-fold cross-validation was used for the model training and testing. The respective script is python vikram/human_legnet/core.py which be used as follows:
30+
10-fold cross-validation was used for the model training and testing. The respective script is python vikram/human_legnet/core.py which should be used as follows:
3331

3432
```
3533
python core.py --model_dir <target dir to save the models checkpoint files> --data_path <experiment data tables used in study> --epoch_num <epochnum> --use_shift --reverse_augment
3634
```
3735

38-
Please use --help for more details regarding the command line parameters.
36+
Please use `--help` for more details regarding the command line parameters.
3937

40-
This script was also used to test the impact of data augmentation on the model performance, e.g, adding --use_reverse_channel will add the respective channel to the input.
38+
This script was also used to test the impact of data augmentation on the model performance, e.g,
39+
adding `--use_reverse_channel` will add the respective channel to the input.
4140

4241
In the study, the shift was set to the size of the forward adapter (21 bp).
4342

44-
The data used to train model is available at `datasets/original` dir.
45-
Files with name format `<cell_line>.tsv` were used for training and files with name format `<cell_line>_averaged.tsv` were used for accessing model final performance.
43+
For convenience, the data used to train the model is available at `datasets/original` dir.
44+
Files with name format `<cell_line>.tsv` were used for training and files with name format `<cell_line>_averaged.tsv` were used for accessing model's final performance.
4645

47-
Trained models can be dowloaded [here](https://disk.yandex.ru/d/ABO-qfuYuuqCww)
46+
Pre-trained models as well as the respective experimental data can be downloaded [here](https://zenodo.org/records/8219231).
4847

4948
## Demo example
5049

@@ -56,17 +55,16 @@ python core.py --model_dir demo_K562 --data_path datasets/original/K562.tsv --ep
5655

5756
This command will take about 5 minutes to complete on 1 NVIDIA GeForce RTX 3090 using 8 CPUs.
5857

59-
It will produce dir demo with:
58+
It will produce `demo` dir containing:
6059

6160
1. `config.json` -- model config required to run predictions
62-
2. `model_2_1` containing model weights `lightning_logs/version_0/checkpoints/last_model-epoch=24.ckpt` and predictions for test fold in file `predictions_new_format.tsv`
63-
64-
File `predictions_new_format.tsv` contains predictions for forward and reverse sequence orientation in columns `forw_pred` and `rev_pred` respectively
61+
2. `model_2_1` containing model weights `lightning_logs/version_0/checkpoints/last_model-epoch=24.ckpt` and predictions for the test fold in `predictions_new_format.tsv` file
6562

63+
The `predictions_new_format.tsv` file will contain predictions for forward and reverse sequence orientation in columns `forw_pred` and `rev_pred` respectively.
6664

67-
# Assessing LegNet performance in predicting the allele-specific events
65+
# Assessing LegNet performance in predicting the allele-specific variants
6866

69-
To get predictions of all cross-validation models we used asb_predict.py for [ADASTRA](https://adastra.autosome.org) (allele-specific transcription factor binding) and coverage_predict.py for [UDACHA](https://udacha.autosome.org) (allele-specific chromatin accessibility) datasets.
67+
To get predictions from all cross-validation models we used asb_predict.py for [ADASTRA](https://adastra.autosome.org) (allele-specific transcription factor binding) and coverage_predict.py for [UDACHA](https://udacha.autosome.org) (allele-specific chromatin accessibility) datasets.
7068

7169
Command-line format:
7270
```
@@ -75,27 +73,27 @@ python asb_predict.py --config <model config> --model <models dir> --asb_path <p
7573
```
7674
python coverage_predict.py --config <model config> --model <models dir> --cov_path <path to the UDACHA dataset> --genome <path to human genome> --out_path <out file> --device 0 --max_shift 0
7775
```
78-
The datasets are available in the repository, see the datasets/asb and datasets/coverage folders.
76+
The datasets are available in the repository, see the `datasets/asb` and `datasets/coverage` folders.
7977

80-
The scripts compare the predicted and the real effect direction of single-nucleotide variants for the alternative against the reference allele at allele-specific regulatory sites. That is whether the reference or the alternative allele is preferable for transcription factor binding or for chromatin accessibility.
78+
The scripts compare the predicted and the real effect direction of single-nucleotide variants for the alternative against the reference allele at allele-specific regulatory sites. That is whether the reference or the alternative allele is preferable for transcription factor binding or chromatin accessibility.
8179

82-
The post-processing of predictions is performed with Jupyter-Notebook variant_annotation.ipynb.
80+
The post-processing of predictions is performed with Jupyter-Notebook `variant_annotation.ipynb`.
8381

84-
Estimating the performance metrics (Fisher's exact test, R-squared calculation) for the processed predictions is performed using R script analyze_concordance.R
82+
Estimating the performance metrics (Fisher's exact test, R-squared calculation) for the processed predictions is performed using R script `analyze_concordance.R`.
8583

8684
# Performing LegNet predictions for a user-supplied fasta
8785

88-
It is possible to run a LegNet model against a user-supplied fasta to obtain predictions for each sequence in it. This can be achieved with fasta_predict.py.
86+
It is possible to run the LegNet model against a user-supplied fasta to obtain predictions for each sequence. This can be achieved with `fasta_predict.py`.
8987

9088
Example command:
9189
```
9290
python fasta_predict.py --config <model config> --model <model checkpoint> --fasta <path to fasta> --out_path <out path> --device 0
9391
```
9492

95-
Please note, that this is a fairly basic wrapper, and this version of LegNet is optimized to sequence length of 230bp.
93+
Please note that this is a fairly basic wrapper, and this version of LegNet is optimized to handle 230bp long sequences.
9694
LegNet technically will work for different sequence lengths as it uses global average pooling.
97-
However, if a sequence size differs from 230 significantly, the resulting performance will be likely rather low.
98-
Also, due to the per-batch prediction, it is impossible to predict scores for sequences of different sizes if the batch size is not equal to 1.
95+
However, if the sequence size differs from 230 significantly, the resulting performance will be reduced.
96+
Due to the per-batch prediction, it is impossible to predict scores for sequences of different sizes if the batch size is not equal to 1.
9997

10098
# License
10199

0 commit comments

Comments
 (0)