You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* changed path to which `protein_coding_genes.parquet` is saved, when no path is configured in config to make file more visible.
* Made `gene_id_file` mandatory. Added instructions to create it manually in docs
* fixup! Format Python code with psf/black pull_request
* added gene id file to example use case
* spell check
* spelling again
---------
Co-authored-by: PMBio <PMBio@users.noreply.github.com>
Copy file name to clipboardExpand all lines: docs/annotations.md
+26-5Lines changed: 26 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable
9
9
## Output
10
10
This pipeline outputs a parquet file including all annotations as well as a file containing IDs to all protein coding genes needed to run DeepRVAT.
11
11
Apart from this the pipeline outputs a PCA transformation matrix for deepSEA as well as means and standard deviations used to standardize deepSEA scores before PCA analysis. This is helpful to recreate results using a different dataset.
12
-
Furthermore, the pipeline outputs one annotation file for VEP, CADD, DeepRiPe, DeepSea and Absplice for each input vcf-file. The tool then creates concatenates the files, performs PCA on the deepSEA scores and merges the result into a single file.
12
+
Furthermore, the pipeline outputs one annotation file for VEP, CADD, DeepRiPe, DeepSea and AbSplice for each input vcf-file. The tool then creates concatenates the files, performs PCA on the deepSEA scores and merges the result into a single file.
Also a reference GTF file containing transcript annotations is required, this can be downloaded from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz)
30
+
Further requirements:
31
+
- A reference GTF file containing transcript annotations is required, this can be downloaded from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz).
32
+
- A file containing all genes, which deeprvat should consider together with a unique integer id for each gene. This file may be created manually by the user or automatically using the gtf file as input to create a gene id file for all protein coding genes. See [here](#geneid) for more details.
33
+
31
34
32
35
33
36
## Configure the annotation pipeline
@@ -38,6 +41,7 @@ The config above would use the following directory structure:
38
41
|--reference
39
42
||-- fasta file
40
43
||-- GTF file
44
+
||-- gene id file
41
45
42
46
|-- preprocessing_workdir
43
47
||-- norm
@@ -80,6 +84,7 @@ A GTF file as described in [requirements](#requirements) and the FASTA file used
80
84
The output is stored in the `output_dir/annotations` folder and any temporary files in the `tmp` subfolder. All repositories used including VEP with its corresponding cache as well as plugins are stored in `repo_dir`.
81
85
Data for VEP plugins and the CADD cache are stored in `annotation_data`.
82
86
87
+
(running)=
83
88
## Running the annotation pipeline on example data
84
89
85
90
@@ -113,14 +118,14 @@ Modify the path in the [config file](https://github.com/PMBio/deeprvat/blob/main
113
118
114
119
## Configuring the annotation pipeline
115
120
116
-
You can add/remove VEP plugins in the `additional_vep_plugin_cmds` part of the config by adding /removing plugin commands to be added to the vep run command. You can omit absplice/deepSea by setting `include_absplice`/ `include_deepSEA` to `False`in the config. When you add/remove annotations you have to alter the values in`example/config/annotation_colnames_filling_values.yaml`. This file consist of the names of the columns of the tool used, the name to be used in the output data frame, the default value replacing all `NA` values as well as the data type, for example:
121
+
You can add/remove VEP plugins in the `additional_vep_plugin_cmds` part of the config by adding /removing plugin commands to be added to the vep run command. You can omit abSplice and/or deepSea by setting `include_absplice`/ `include_deepSEA` to `False`in the config. When you add/remove annotations you have to alter the values in`example/config/annotation_colnames_filling_values.yaml`. This file consist of the names of the columns of the tool used, the name to be used in the output data frame, the default value replacing all `NA` values as well as the data type, for example:
117
122
```shell
118
123
'CADD_RAW':
119
124
- 'CADD_raw'
120
125
- 0
121
126
- float
122
127
```
123
-
Here `CADD_RAW` is the name of the column of the VEP output when the plugin is used, it is then renamed in the final annotation dataframe to `CADD_raw`, all `NA` values are set to `0` and the values are of type`float`.
128
+
Here `CADD_RAW` is the name of the column of the VEP output when the plugin is used, it is then renamed in the final annotation data frame to `CADD_raw`, all `NA` values are set to `0` and the values are of type`float`.
124
129
125
130
You can also modify the `example/config/annotation_colnames_filling_values.yaml` file to choose custom filling values for each of the annotations.
126
131
For each of the annotations the second value represents the value to use to fill in`NA` values, i.e. in the example above, in the `CADD_raw` column `NA` values are filled using `0`.
@@ -130,7 +135,7 @@ keep_unfilled: True
130
135
```
131
136
to the [config file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml).
132
137
133
-
You can also change the way the allele frequencies are calculated by adding `af_mode` key to the [config file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml). By default, the allele frequencies are calculated from the data the annotation pipeline is run with. To use gnomade or gnomadg allele frequncies (from VEP ) instead, add
138
+
You can also change the way the allele frequencies are calculated by adding `af_mode` key to the [config file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml). By default, the allele frequencies are calculated from the data the annotation pipeline is run with. To use gnomade or gnomadg allele frequencies (from VEP ) instead, add
134
139
```shell
135
140
af_mode :'af_gnomade'
136
141
```
@@ -140,6 +145,22 @@ af_mode : 'af_gnomadg'
140
145
```
141
146
to the config file.
142
147
148
+
(geneid)=
149
+
## Gene id file
150
+
As mentioned in the [requirements](#requirements) section, the pipeline expects a parquet file containing all genes that deeprvat should consider, together with a unique integer id for each gene.
151
+
This file can be created automatically using a GTF file as input. The output is then a parquet file in the expected format containing all protein coding genes of the provided GTF file.
152
+
To automatically create the gene id file, make sure the annotation environment (mentioned [here](#running) ) is active and run
with `deeprvat/example/annotations/reference/gencode.v44.annotation.gtf.gz` pointing to any downloaded GTF file and `deeprvat/example/annotations/reference/protein_coding_genes.parquet` pointing to the desired output path, which has to be specified in the config file.
157
+
158
+
Alternatively, when the user want to selecta specific set of genes to consider, the gene id file may be created by the user. The file is expected to have two columns:
159
+
- column`gene`:`str` name for each gene
160
+
- column `id`:`int` unique id for each gene
161
+
Each row represents a gene the user want to include in the analysis.
0 commit comments