AbExp variant effect prediction pipeline

AbExp is a tool to predict aberrant gene expression in 49 human tissue based on DNA sequence variants. It was trained on aberrant gene expression calls from the GTEx dataset.

This repository contains a bioinformatics software pipeline for calculating AbExp variant effect predictions, taking vcf files as input. The publication to this method can be found in Nature Communications.

Minimum resource requirements

Linux
Disk space: 1.0-1.4TB for cache files (hg19 + hg38)
- LOFTEE: 25GB
- VEP v108: 113GB
- CADD v1.6: 854GB
- SpliceAI-RocksDB (optional): 349GB
RAM: 64GB
GPU supporting CUDA for SpliceAI annotation

Setup

Install conda and mamba on your system.

Tip: Use the conda-libmamba-solver for an improved experience when installing environments with conda

Download the VEP cache (if not existing yet):

VEP_CACHE_PATH="<your cache path here>"
VEP_VERSION=108

mamba env create -f scripts/veff/vep_env.v108.yaml --name vep_v108
conda activate vep_v108

bash misc/install_vep_cache/install_cache_for_version.sh $VEP_VERSION $VEP_CACHE_PATH

conda deactivate

Download the CADD cache (if not existing yet):

CADD_CACHE_PATH="<your cache path here>"

bash misc/download_CADD_v1.6.sh $CADD_CACHE_PATH

Download LOFTEE data and scripts:

LOFTEE_DIR="<your path here>"

bash misc/install_vep_cache/download_loftee.sh $LOFTEE_DIR

Configure system_config.yaml:
- specify paths to the VEP cache, CADD cache, LOFTEE data and LOFTEE source code as defined in steps 2-4
- (optional) Disable downloading the SpliceAI-RocksDB cache for pre-computed SpliceAI annotations by setting absplice.use_spliceai_rocksdb to False
- (optional) Change file paths of automatically downloaded annotations to shared location
- (optional) Any options defined in defaults.yaml can be overwritten in this file if necessary
Run mamba env create -f envs/abexp-veff-py.yaml
Activate the created environment: conda activate abexp-veff-py

Usage

Edit the config.yaml and specify the following parameters:
- vcf_input_dir. All .vcf|.vcf.gz|.vcf.bgz|.bcf files in this folder will be annotated. Genotypes are not required.
- vcf_is_normalized: True if all variants are left-normalized and biallelic (bcftools norm -cs -m). Otherwise, the pipeline will normalize the variants before annotation.
- output_dir
- fasta_file and gtf_file corresponding to the human_genome_version. The gtf_file needs to contain Ensembl gene and transcript identifiers. Therefore, it is highly recommended to use the Gencode genome annotations.
An example is pre-configured and can be used to test the pipeline.
Run snakemake --use-conda -c all --configfile=config.yaml. All rules are annotated with resource requirements s.t. snakemake can submit jobs to HPC clusters or cloud environments. It is highly recommended to use snakemake with some batch submission system, e.g. SLURM. For further information, please visit the Snakemake documentation.
The resulting variant effect predictions will be stored in <output_dir>/predict/abexp_v1.1/<input_vcf_file>.parquet. It will contain the following columns:
- 'chrom': chromosome of the variant
- 'start': start position of the variant (0-based)
- 'end': end position of the variant (1-based)
- 'ref': reference allele
- 'alt': alternate allele
- 'gene': the gene affected by the variant
- 'tissue': GTEx tissue, e.g. "Artery - Tibial"
- 'tissue_type', GTEx tissue type, e.g. "Blood Vessel"
- 'abexp_v1.1': The predicted AbExp score
- a set of features used to predict the AbExp score

License

All source code and model weights in this repository are licensed under the MIT license.

Please note: AbExp relies on CADD and SpliceAI, both of which are free to use only in non-commercial settings. If you plan to use AbExp in a commercial context, please ensure that you have the appropriate permissions or licenses to use both tools.

Development setup

Advanced users who want to edit this pipeline can use the following steps to convert the python scripts back to Jupyter notebooks:

Make sure that the jupytext command is available, e.g. via mamba install jupytext
run find scripts/ -iname "*[.py.py|.R.R]" -exec jupytext --sync {} \; to convert all percent scripts to jupyter notebooks Jupyter will then automatically synchronize the percent scripts with the corresponding notebook files.

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
envs		envs
example		example
misc		misc
resources		resources
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
defaults.yaml		defaults.yaml
download_urls.yaml		download_urls.yaml
snakefile_utils.smk		snakefile_utils.smk
system_config.yaml		system_config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AbExp variant effect prediction pipeline

Minimum resource requirements

Setup

Usage

License

Development setup

About

Uh oh!

Releases 1

Packages

Contributors 3

Uh oh!

Languages

License

gagneurlab/AbExp

Folders and files

Latest commit

History

Repository files navigation

AbExp variant effect prediction pipeline

Minimum resource requirements

Setup

Usage

License

Development setup

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Uh oh!

Languages

Packages