Skip to content

BfArM-MVH/GRZ_QC_Workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nextflow run with conda run with docker run with singularity

Introduction

This workflow is designed to compute quality metrics as required by BfArM for genome data centers (Genomrechenzentren, GRZs) and serves as a reference implementation for all Leistungserbringer (LEs).

Important

  • Leistungserbringer are not required to use this specific workflow to calculate metrics. Any method that produces reasonably matching results can be used. This workflow will be used by the GRZ's to validate the reported metrics.
  • Please note that we are neither permitted nor able to provide direct support for running this QC workflow in hospitals.
  • Features such as running on pre-mapped reads are not part of the official requirements, but are offered as helpful additions for LEs when feasible.
  • We greatly appreciate collaboration and encourage contributions to help improve the workflow.

This workflow is built using Nextflow and processes data roughly according to the following steps:

  1. Read QC and trimming (FastQC and fastp/fastplong)
    • The pipeline uses the default settings for fastp/fastplong: adapter trimming is enabled and reads with >40% Q15 or lower bases are dropped.
  2. Alignment using bwa-mem2 or minimap2
  3. MarkDuplicates using Sambamba (short reads only)
  4. Coverage calculation by mosdepth
  5. Present QC for raw reads (MultiQC)

A high-level diagram of the GRZ QC workflow in a metro map style

Details on the coverage calculation of different library types can be found in the documentation.

For the exact command lines executed by the pipeline you may check the workflow reports automatically generated by the test pipeline on GitHub. To access these reports, click on the title of the latest successful pipeline run and download one of the nextflow-pipeline-info artifacts under the "Artifacts" section at the bottom of the page. The command lines are detailed the the "Tasks" table at the bottom of the execution_report_*.html file.

Setup

  • Install nextflow (and dependencies)
  • Make sure to have either conda, docker or singularity.
  • Clone the github repository
git clone https://github.com/BfArM-MVH/GRZ_QC_Workflow.git
output_basepath="path/to/analysis/dir"
mkdir -p ${output_basepath}/grzqc_output

Usage

This pipeline needs one of the following two inputs:

  1. A submission base directory path with a folder structure following GRZ submission standard. You can also check test datasets.

  2. Use a csv samplesheet as input. This gives more flexibility providing optional starting points for the analysis from either reads or alignments, and you won't need a GRZ submission directory.

Here are the instructions for the case (1). For (2) please see the documentation.

You can run the pipeline with flag --submission_basepath:

submission_basepath="path/to/submission/base/directory"
nextflow run main.nf \
    -profile docker \
    --outdir "${output_basepath}/grzqc_output/" \
    --submission_basepath "${submission_basepath}"

Depending on the resouces on your machine and your task, it is recommended to create and and run with your own config file, see estimated resource requirements for WGS and nextflow documentation.

Caching reference files

With the above code, the pipeline can automatically download the necessary reference genomes and creates indices from them. However, when running this pipeline multiple times on different submissions, the download and indexing steps create unnecessary overhead. Further, some environments will not have internet access to download references for each run.

Therefore, it is recommended to run the test_GRCh37 and test_GRCh38 profiles to set up reference files and to be sure that all of the necessary images and containers are set up correctly. Thus, depending on the reference genome you will use, run the profile:

nextflow run main.nf \
    -profile test_GRCh37,docker
    --reference_path "your/reference/path"

or/and

nextflow run main.nf \
    -profile test_GRCh38,docker
    --reference_path "your/reference/path"

Please replace docker with singularity or conda depending on your system. This pipeline is able to run with any of those three profiles.

Now, all the necessary files are saved into your/reference/path and will be re-used for subsequent pipeline runs that specify the same --reference_path.

You can also safely remove the test pipeline results by running the following from the cloned repository root.

rm -rf tests/results

A more detailed description of reference files usage can be found here.

Pipeline output

For more details about the output files and reports, please refer to the output documentation.

report.csv

Column Description
sampleId Sample ID
donorPseudonym A unique identifier given by the Leistungserbringer for each donor.
labDataName Lab data name
libraryType Library type, e.g., wes for whole-exome sequencing
sequenceSubtype Sequence subtype, e.g., somatic or germline
genomicStudySubtype Genomic study subtype, e.g., tumor+germline
qualityControlStatus Submission quality control status. Only reported if pre-computed metrics provided.
meanDepthOfCoverage Mean depth of coverage
meanDepthOfCoverageProvided Mean depth of coverage from metadata/samplesheet, if provided.
meanDepthOfCoverageRequired Mean depth of coverage required to pass QC
meanDepthOfCoverageDeviation Percent deviation of computed coverage from provided coverage.
meanDepthOfCoverageQCStatus PASS or TOO LOW, depending on percent deviation.
percentBasesAboveQualityThreshold Percent of bases passing the quality threshold
qualityThreshold The quality threshold to pass
percentBasesAboveQualityThresholdProvided Percent of bases passing the quality threshold from metadata/samplesheet, if provided.
percentBasesAboveQualityThresholdRequired Percent of bases above the quality threshold required to pass QC
percentBasesAboveQualityThresholdDeviation Percent deviation of computed metric from provided metric.
percentBasesAboveQualityThresholdQCStatus PASS or TOO LOW, depending on percent devation.
targetedRegionsAboveMinCoverage Fraction of targeted regions above minimum coverage
minCoverage Minimum coverage for target regions
targetedRegionsAboveMinCoverageProvided Fraction of targeted regions above minimum coverage from metadataa/samplesheet, if provided.
targetedRegionsAboveMinCoverageRequired Fraction of targeted regions above minimum coverage required to pass QC
targetedRegionsAboveMinCoverageQCStatus PASS or TOO LOW, depending on percent devation.

MultiQC

A MultiQC HTML report is also generated by the pipeline. Descriptions of the table columns can be found by hovering over the headers.

Thresholds

QC thresholds are read from thresholds.json, which uses the values defined by BfArM.

Running the pipeline offline

Nextflow can automatically retrieve almost everything necessary to execute a pipeline from the web, including pipeline code, software dependencies, reference genomes, and remote data sources.

However, if your analysis must run on a system without internet access, you will need to take a few additional steps to ensure all required components are available locally. First, download everything on an internet-connected system (such as your personal computer) and then transfer the files to the offline system using your preferred method.

To set up an offline environment, you will need three key components: a functioning Nextflow installation, the pipeline assets, and the required reference genomes.

On a computer with an internet connection, to download the pipeline, run:

nf-core pipelines download BfArM-MVH/GRZ_QC_Workflow

Add the argument --container-system singularity to also fetch the singularity container(s).

Then download the necessary plugins and lace it under ${NXF_HOME}/plugins:

nextflow plugin install nf-schema@2.1.1

The default reference files used in this pipeline are as follows:

  • GRCh37: s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
  • GRCh38: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz

For more detailed information please check "Running offline by nf-core"

Estimated resource requirements

Using the 466 GB WGS_tumor+germline test submission dataset from the example GRZ submissions, the pipeline used the following resources:

if reference build involved:

  • 618 CPU hours
  • 72 GB maximum RAM (genome indexing)
  • 2 TB storage (including the input files)
  • Takes around 3 days

The biggest jobs were the two bwa-mem2 alignments which used 300 CPU hours each and a maximum of 32 GB of RAM.

Without reference build:

  • 600 to 620 CPU hours
  • 56 GB maximum RAM
  • 2 TB storage (saving alignment files)
  • Takes around 2 and half days.

Contributions and Support

BfArM-MVH/GRZ_QC_Workflow was originally written by Yun Wang, Kübra Narci, Travis Wrightsman, Shounak Chakraborty and Florian R. Hölzlwimmer.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 6