AbExp is a tool to predict aberrant gene expression in 49 human tissue based on DNA sequence variants. It was trained on aberrant gene expression calls from the GTEx dataset.
This repository contains a bioinformatics software pipeline for calculating AbExp variant effect predictions, taking vcf files as input. The publication to this method can be found in Nature Communications.
- Linux
- Disk space: 1.0-1.4TB for cache files (hg19 + hg38)
- LOFTEE: 25GB
- VEP v108: 113GB
- CADD v1.6: 854GB
- SpliceAI-RocksDB (optional): 349GB
- RAM: 64GB
- GPU supporting CUDA for SpliceAI annotation
-
Install conda and mamba on your system.
Tip: Use the conda-libmamba-solver for an improved experience when installing environments with
conda
-
Download the VEP cache (if not existing yet):
VEP_CACHE_PATH="<your cache path here>" VEP_VERSION=108 mamba env create -f scripts/veff/vep_env.v108.yaml --name vep_v108 conda activate vep_v108 bash misc/install_vep_cache/install_cache_for_version.sh $VEP_VERSION $VEP_CACHE_PATH conda deactivate
-
Download the CADD cache (if not existing yet):
CADD_CACHE_PATH="<your cache path here>" bash misc/download_CADD_v1.6.sh $CADD_CACHE_PATH
-
Download LOFTEE data and scripts:
LOFTEE_DIR="<your path here>" bash misc/install_vep_cache/download_loftee.sh $LOFTEE_DIR
-
Configure
system_config.yaml
:- specify paths to the VEP cache, CADD cache, LOFTEE data and LOFTEE source code as defined in steps 2-4
- (optional) Disable downloading the SpliceAI-RocksDB cache for pre-computed SpliceAI annotations by setting absplice.use_spliceai_rocksdb to False
- (optional) Change file paths of automatically downloaded annotations to shared location
- (optional) Any options defined in
defaults.yaml
can be overwritten in this file if necessary
-
Run
mamba env create -f envs/abexp-veff-py.yaml
-
Activate the created environment:
conda activate abexp-veff-py
-
Edit the
config.yaml
and specify the following parameters:vcf_input_dir
. All.vcf|.vcf.gz|.vcf.bgz|.bcf
files in this folder will be annotated. Genotypes are not required.vcf_is_normalized: True
if all variants are left-normalized and biallelic (bcftools norm -cs -m
). Otherwise, the pipeline will normalize the variants before annotation.output_dir
fasta_file
andgtf_file
corresponding to thehuman_genome_version
. Thegtf_file
needs to contain Ensembl gene and transcript identifiers. Therefore, it is highly recommended to use the Gencode genome annotations.
An example is pre-configured and can be used to test the pipeline.
-
Run
snakemake --use-conda -c all --configfile=config.yaml
. All rules are annotated with resource requirements s.t. snakemake can submit jobs to HPC clusters or cloud environments. It is highly recommended to use snakemake with some batch submission system, e.g. SLURM. For further information, please visit the Snakemake documentation. -
The resulting variant effect predictions will be stored in
<output_dir>/predict/abexp_v1.1/<input_vcf_file>.parquet
. It will contain the following columns:- 'chrom': chromosome of the variant
- 'start': start position of the variant (0-based)
- 'end': end position of the variant (1-based)
- 'ref': reference allele
- 'alt': alternate allele
- 'gene': the gene affected by the variant
- 'tissue': GTEx tissue, e.g. "Artery - Tibial"
- 'tissue_type', GTEx tissue type, e.g. "Blood Vessel"
- 'abexp_v1.1': The predicted AbExp score
- a set of features used to predict the AbExp score
All source code and model weights in this repository are licensed under the MIT license.
Please note: AbExp relies on CADD and SpliceAI, both of which are free to use only in non-commercial settings. If you plan to use AbExp in a commercial context, please ensure that you have the appropriate permissions or licenses to use both tools.
Advanced users who want to edit this pipeline can use the following steps to convert the python scripts back to Jupyter notebooks:
- Make sure that the
jupytext
command is available, e.g. viamamba install jupytext
- run
find scripts/ -iname "*[.py.py|.R.R]" -exec jupytext --sync {} \;
to convert all percent scripts to jupyter notebooks Jupyter will then automatically synchronize the percent scripts with the corresponding notebook files.