Skip to content

Commit 39ed102

Browse files
Marcel-MueckPMBio
and
PMBio
authored
changed path ofprotein_coding_genes.parquet (#157)
* changed path to which `protein_coding_genes.parquet` is saved, when no path is configured in config to make file more visible. * Made `gene_id_file` mandatory. Added instructions to create it manually in docs * fixup! Format Python code with psf/black pull_request * added gene id file to example use case * spell check * spelling again --------- Co-authored-by: PMBio <PMBio@users.noreply.github.com>
1 parent 0d91fc9 commit 39ed102

File tree

6 files changed

+31
-21
lines changed

6 files changed

+31
-21
lines changed

deeprvat/annotations/annotations.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1988,7 +1988,7 @@ def add_gene_ids(gene_id_file: str, annotations_path: str, out_file: str):
19881988
"""
19891989
genes = pd.read_parquet(gene_id_file)
19901990
genes[["gene_base", "feature"]] = genes["gene"].str.split(".", expand=True)
1991-
genes.drop(columns=["feature", "gene", "gene_name", "gene_type"], inplace=True)
1991+
genes = genes[["id", "gene_base"]]
19921992
genes.rename(columns={"id": "gene_id"}, inplace=True)
19931993
annotations = pd.read_parquet(annotations_path)
19941994
len_anno = len(annotations)

docs/annotations.md

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable
99
## Output
1010
This pipeline outputs a parquet file including all annotations as well as a file containing IDs to all protein coding genes needed to run DeepRVAT.
1111
Apart from this the pipeline outputs a PCA transformation matrix for deepSEA as well as means and standard deviations used to standardize deepSEA scores before PCA analysis. This is helpful to recreate results using a different dataset.
12-
Furthermore, the pipeline outputs one annotation file for VEP, CADD, DeepRiPe, DeepSea and Absplice for each input vcf-file. The tool then creates concatenates the files, performs PCA on the deepSEA scores and merges the result into a single file.
12+
Furthermore, the pipeline outputs one annotation file for VEP, CADD, DeepRiPe, DeepSea and AbSplice for each input vcf-file. The tool then creates concatenates the files, performs PCA on the deepSEA scores and merges the result into a single file.
1313

1414
## Input
1515

@@ -27,7 +27,10 @@ Download paths:
2727
- [PrimateAI](https://basespace.illumina.com/s/yYGFdGih1rXL) PrimateAI supplementary data/"PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz"
2828
- [AlphaMissense](https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz)
2929

30-
Also a reference GTF file containing transcript annotations is required, this can be downloaded from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz)
30+
Further requirements:
31+
- A reference GTF file containing transcript annotations is required, this can be downloaded from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz).
32+
- A file containing all genes, which deeprvat should consider together with a unique integer id for each gene. This file may be created manually by the user or automatically using the gtf file as input to create a gene id file for all protein coding genes. See [here](#geneid) for more details.
33+
3134

3235

3336
## Configure the annotation pipeline
@@ -38,6 +41,7 @@ The config above would use the following directory structure:
3841
|--reference
3942
| |-- fasta file
4043
| |-- GTF file
44+
| |-- gene id file
4145

4246
|-- preprocessing_workdir
4347
| |-- norm
@@ -80,6 +84,7 @@ A GTF file as described in [requirements](#requirements) and the FASTA file used
8084
The output is stored in the `output_dir/annotations` folder and any temporary files in the `tmp` subfolder. All repositories used including VEP with its corresponding cache as well as plugins are stored in `repo_dir`.
8185
Data for VEP plugins and the CADD cache are stored in `annotation_data`.
8286

87+
(running)=
8388
## Running the annotation pipeline on example data
8489

8590

@@ -113,14 +118,14 @@ Modify the path in the [config file](https://github.com/PMBio/deeprvat/blob/main
113118

114119
## Configuring the annotation pipeline
115120

116-
You can add/remove VEP plugins in the `additional_vep_plugin_cmds` part of the config by adding /removing plugin commands to be added to the vep run command. You can omit absplice/deepSea by setting `include_absplice`/ `include_deepSEA` to `False`in the config. When you add/remove annotations you have to alter the values in `example/config/annotation_colnames_filling_values.yaml`. This file consist of the names of the columns of the tool used, the name to be used in the output data frame, the default value replacing all `NA` values as well as the data type, for example:
121+
You can add/remove VEP plugins in the `additional_vep_plugin_cmds` part of the config by adding /removing plugin commands to be added to the vep run command. You can omit abSplice and/or deepSea by setting `include_absplice`/ `include_deepSEA` to `False`in the config. When you add/remove annotations you have to alter the values in `example/config/annotation_colnames_filling_values.yaml`. This file consist of the names of the columns of the tool used, the name to be used in the output data frame, the default value replacing all `NA` values as well as the data type, for example:
117122
```shell
118123
'CADD_RAW' :
119124
- 'CADD_raw'
120125
- 0
121126
- float
122127
```
123-
Here `CADD_RAW` is the name of the column of the VEP output when the plugin is used, it is then renamed in the final annotation dataframe to `CADD_raw`, all `NA` values are set to `0` and the values are of type `float`.
128+
Here `CADD_RAW` is the name of the column of the VEP output when the plugin is used, it is then renamed in the final annotation data frame to `CADD_raw`, all `NA` values are set to `0` and the values are of type `float`.
124129

125130
You can also modify the `example/config/annotation_colnames_filling_values.yaml` file to choose custom filling values for each of the annotations.
126131
For each of the annotations the second value represents the value to use to fill in `NA` values, i.e. in the example above, in the `CADD_raw` column `NA` values are filled using `0`.
@@ -130,7 +135,7 @@ keep_unfilled: True
130135
```
131136
to the [config file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml).
132137

133-
You can also change the way the allele frequencies are calculated by adding `af_mode` key to the [config file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml). By default, the allele frequencies are calculated from the data the annotation pipeline is run with. To use gnomade or gnomadg allele frequncies (from VEP ) instead, add
138+
You can also change the way the allele frequencies are calculated by adding `af_mode` key to the [config file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml). By default, the allele frequencies are calculated from the data the annotation pipeline is run with. To use gnomade or gnomadg allele frequencies (from VEP ) instead, add
134139
```shell
135140
af_mode : 'af_gnomade'
136141
```
@@ -140,6 +145,22 @@ af_mode : 'af_gnomadg'
140145
```
141146
to the config file.
142147

148+
(geneid)=
149+
## Gene id file
150+
As mentioned in the [requirements](#requirements) section, the pipeline expects a parquet file containing all genes that deeprvat should consider, together with a unique integer id for each gene.
151+
This file can be created automatically using a GTF file as input. The output is then a parquet file in the expected format containing all protein coding genes of the provided GTF file.
152+
To automatically create the gene id file, make sure the annotation environment (mentioned [here](#running) ) is active and run
153+
```
154+
deeprvat_annotations create-gene-id-file deeprvat/example/annotations/reference/gencode.v44.annotation.gtf.gz deeprvat/example/annotations/reference/protein_coding_genes.parquet
155+
```
156+
with `deeprvat/example/annotations/reference/gencode.v44.annotation.gtf.gz` pointing to any downloaded GTF file and `deeprvat/example/annotations/reference/protein_coding_genes.parquet` pointing to the desired output path, which has to be specified in the config file.
157+
158+
Alternatively, when the user want to select a specific set of genes to consider, the gene id file may be created by the user. The file is expected to have two columns:
159+
- column`gene`:`str` name for each gene
160+
- column `id`:`int` unique id for each gene
161+
Each row represents a gene the user want to include in the analysis.
162+
163+
143164
## References
144165

145166
(reference-1-target)=
Binary file not shown.

example/config/deeprvat_annotation_config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ kipoiveff_repo_dir : repo_dir/kipoi-veff2
2525
faatpipe_repo_dir : repo_dir/faatpipe
2626
vep_repo_dir : repo_dir/ensembl-vep
2727
preprocessing_workdir : ../preprocess/workdir
28+
gene_id_parquet: reference/protein_coding_genes.parquet
2829
additional_vep_plugin_cmds:
2930
cadd : CADD,annotation_data/cadd/whole_genome_SNVs.tsv.gz,annotation_data/cadd/gnomad.genomes.r3.0.indel.tsv.gz
3031
spliceAI : SpliceAI,snv=annotation_data/spliceAI/spliceai_scores.raw.snv.hg38.vcf.gz,indel=annotation_data/spliceAI/spliceai_scores.raw.indel.hg38.vcf.gz

example/config/deeprvat_annotation_config_minimal.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
fasta_dir : reference
77
fasta_file_name : GRCh38.primary_assembly.genome.fa
88
gtf_file_name : gencode.v44.annotation.gtf.gz
9+
gene_id_parquet: reference/protein_coding_genes.parquet
910

1011
source_variant_file_pattern : chr{chr}test
1112
source_variant_file_type: 'bcf'
@@ -24,6 +25,7 @@ kipoiveff_repo_dir : repo_dir/kipoi-veff2
2425
faatpipe_repo_dir : repo_dir/faatpipe
2526
vep_repo_dir : repo_dir/ensembl-vep
2627
preprocessing_workdir : preprocessing_workdir
28+
2729
include_absplice : False
2830
include_deepSEA : False
2931
vep_online: True

pipelines/annotations.snakefile

Lines changed: 1 addition & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ genome_assembly = config.get("genome_assembly") or "GRCh38"
4444
fasta_dir = Path(config["fasta_dir"])
4545
fasta_file_name = config["fasta_file_name"]
4646
gtf_file = fasta_dir / config["gtf_file_name"]
47-
gene_id_file = config.get("gene_id_parquet")
47+
gene_id_file = config["gene_id_parquet"]
4848

4949
deeprvat_parent_path = Path(config["deeprvat_repo_dir"])
5050
annotation_python_file = (
@@ -191,20 +191,6 @@ rule all:
191191
chckpt = anno_dir / 'chckpts' / 'select_rename_fill_columns.chckpt',
192192
annotations = anno_dir / 'annotations.parquet'
193193

194-
if not gene_id_file:
195-
gene_id_file = anno_tmp_dir / "protein_coding_genes.parquet"
196-
197-
rule create_gene_id_file:
198-
input:
199-
gtf_file,
200-
output:
201-
gene_id_file,
202-
resources:
203-
mem_mb=lambda wildcards, attempt: 15_000 * (attempt + 1),
204-
shell:
205-
" ".join(
206-
[f"deeprvat_annotations", "create-gene-id-file", "{input}", "{output}"]
207-
)
208194

209195
rule extract_with_header:
210196
input:

0 commit comments

Comments
 (0)