GitHub - BfArM-MVH/GRZ_QC_Workflow

Introduction

This workflow is designed to compute quality metrics as required by BfArM for genome data centers (Genomrechenzentren, GRZs) and serves as a reference implementation for all Leistungserbringer (LEs).

Important

Leistungserbringer are not required to use this specific workflow to calculate metrics. Any method that produces reasonably matching results can be used. This workflow will be used by the GRZ's to validate the reported metrics.
Please note that we are neither permitted nor able to provide direct support for running this QC workflow in hospitals.
Features such as running on pre-mapped reads are not part of the official requirements, but are offered as helpful additions for LEs when feasible.
We greatly appreciate collaboration and encourage contributions to help improve the workflow.

This workflow is built using Nextflow and processes data roughly according to the following steps:

Read QC and trimming (FastQC and fastp/fastplong)
- The pipeline uses the default settings for fastp/fastplong: adapter trimming is enabled and reads with >40% Q15 or lower bases are dropped.
Alignment using bwa-mem2 or minimap2
MarkDuplicates using Sambamba (short reads only)
Coverage calculation by mosdepth
Present QC for raw reads (MultiQC)

Details on the coverage calculation of different library types can be found in the documentation.

For the exact command lines executed by the pipeline you may check the workflow reports automatically generated by the test pipeline on GitHub. To access these reports, click on the title of the latest successful pipeline run and download one of the nextflow-pipeline-info artifacts under the "Artifacts" section at the bottom of the page. The command lines are detailed the the "Tasks" table at the bottom of the execution_report_*.html file.

Setup

Install nextflow (and dependencies)
Make sure to have either conda, docker or singularity.
Clone the github repository

git clone https://github.com/BfArM-MVH/GRZ_QC_Workflow.git
output_basepath="path/to/analysis/dir"
mkdir -p ${output_basepath}/grzqc_output

Usage

This pipeline needs one of the following two inputs:

A submission base directory path with a folder structure following GRZ submission standard. You can also check test datasets.
Use a csv samplesheet as input. This gives more flexibility providing optional starting points for the analysis from either reads or alignments, and you won't need a GRZ submission directory.

Here are the instructions for the case (1). For (2) please see the documentation.

You can run the pipeline with flag --submission_basepath:

submission_basepath="path/to/submission/base/directory"
nextflow run main.nf \
    -profile docker \
    --outdir "${output_basepath}/grzqc_output/" \
    --submission_basepath "${submission_basepath}"

Depending on the resouces on your machine and your task, it is recommended to create and and run with your own config file, see estimated resource requirements for WGS and nextflow documentation.

Caching reference files

With the above code, the pipeline can automatically download the necessary reference genomes and creates indices from them. However, when running this pipeline multiple times on different submissions, the download and indexing steps create unnecessary overhead. Further, some environments will not have internet access to download references for each run.

Therefore, it is recommended to run the test_GRCh37 and test_GRCh38 profiles to set up reference files and to be sure that all of the necessary images and containers are set up correctly. Thus, depending on the reference genome you will use, run the profile:

nextflow run main.nf \
    -profile test_GRCh37,docker
    --reference_path "your/reference/path"

or/and

nextflow run main.nf \
    -profile test_GRCh38,docker
    --reference_path "your/reference/path"

Please replace docker with singularity or conda depending on your system. This pipeline is able to run with any of those three profiles.

Now, all the necessary files are saved into your/reference/path and will be re-used for subsequent pipeline runs that specify the same --reference_path.

You can also safely remove the test pipeline results by running the following from the cloned repository root.

rm -rf tests/results

A more detailed description of reference files usage can be found here.

Pipeline output

For more details about the output files and reports, please refer to the output documentation.

`report.csv`

Column	Description
`sampleId`	Sample ID
`donorPseudonym`	A unique identifier given by the Leistungserbringer for each donor.
`labDataName`	Lab data name
`libraryType`	Library type, e.g., `wes` for whole-exome sequencing
`sequenceSubtype`	Sequence subtype, e.g., `somatic` or `germline`
`genomicStudySubtype`	Genomic study subtype, e.g., `tumor+germline`
`qualityControlStatus`	Submission quality control status. Only reported if pre-computed metrics provided.
`meanDepthOfCoverage`	Mean depth of coverage
`meanDepthOfCoverageProvided`	Mean depth of coverage from metadata/samplesheet, if provided.
`meanDepthOfCoverageRequired`	Mean depth of coverage required to pass QC
`meanDepthOfCoverageDeviation`	Percent deviation of computed coverage from provided coverage.
`meanDepthOfCoverageQCStatus`	`PASS` or `TOO LOW`, depending on percent deviation.
`percentBasesAboveQualityThreshold`	Percent of bases passing the quality threshold
`qualityThreshold`	The quality threshold to pass
`percentBasesAboveQualityThresholdProvided`	Percent of bases passing the quality threshold from metadata/samplesheet, if provided.
`percentBasesAboveQualityThresholdRequired`	Percent of bases above the quality threshold required to pass QC
`percentBasesAboveQualityThresholdDeviation`	Percent deviation of computed metric from provided metric.
`percentBasesAboveQualityThresholdQCStatus`	`PASS` or `TOO LOW`, depending on percent devation.
`targetedRegionsAboveMinCoverage`	Fraction of targeted regions above minimum coverage
`minCoverage`	Minimum coverage for target regions
`targetedRegionsAboveMinCoverageProvided`	Fraction of targeted regions above minimum coverage from metadataa/samplesheet, if provided.
`targetedRegionsAboveMinCoverageRequired`	Fraction of targeted regions above minimum coverage required to pass QC
`targetedRegionsAboveMinCoverageQCStatus`	`PASS` or `TOO LOW`, depending on percent devation.

MultiQC

A MultiQC HTML report is also generated by the pipeline. Descriptions of the table columns can be found by hovering over the headers.

Thresholds

QC thresholds are read from thresholds.json, which uses the values defined by BfArM.

Running the pipeline offline

Nextflow can automatically retrieve almost everything necessary to execute a pipeline from the web, including pipeline code, software dependencies, reference genomes, and remote data sources.

However, if your analysis must run on a system without internet access, you will need to take a few additional steps to ensure all required components are available locally. First, download everything on an internet-connected system (such as your personal computer) and then transfer the files to the offline system using your preferred method.

To set up an offline environment, you will need three key components: a functioning Nextflow installation, the pipeline assets, and the required reference genomes.

On a computer with an internet connection, to download the pipeline, run:

nf-core pipelines download BfArM-MVH/GRZ_QC_Workflow

Add the argument --container-system singularity to also fetch the singularity container(s).

Then download the necessary plugins and lace it under ${NXF_HOME}/plugins:

nextflow plugin install nf-schema@2.1.1

The default reference files used in this pipeline are as follows:

GRCh37: s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
GRCh38: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz

For more detailed information please check "Running offline by nf-core"

Estimated resource requirements

Using the 466 GB WGS_tumor+germline test submission dataset from the example GRZ submissions, the pipeline used the following resources:

if reference build involved:

618 CPU hours
72 GB maximum RAM (genome indexing)
2 TB storage (including the input files)
Takes around 3 days

The biggest jobs were the two bwa-mem2 alignments which used 300 CPU hours each and a maximum of 32 GB of RAM.

Without reference build:

600 to 620 CPU hours
56 GB maximum RAM
2 TB storage (saving alignment files)
Takes around 2 and half days.

Contributions and Support

BfArM-MVH/GRZ_QC_Workflow was originally written by Yun Wang, Kübra Narci, Travis Wrightsman, Shounak Chakraborty and Florian R. Hölzlwimmer.

Name		Name	Last commit message	Last commit date
Latest commit History 254 Commits
.devcontainer		.devcontainer
.github		.github
assets		assets
bin		bin
conf		conf
docs		docs
modules		modules
subworkflows		subworkflows
tests		tests
workflows		workflows
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
environment-dev.conda.linux-64.lock		environment-dev.conda.linux-64.lock
environment-dev.yaml		environment-dev.yaml
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
run_grzqc.sh		run_grzqc.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Setup

Usage

Caching reference files

Pipeline output

`report.csv`

MultiQC

Thresholds

Running the pipeline offline

Estimated resource requirements

Contributions and Support

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

BfArM-MVH/GRZ_QC_Workflow

Folders and files

Latest commit

History

Repository files navigation

Introduction

Setup

Usage

Caching reference files

Pipeline output

report.csv

MultiQC

Thresholds

Running the pipeline offline

Estimated resource requirements

Contributions and Support

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

`report.csv`

Packages