This workflow is designed to compute quality metrics as required by BfArM for genome data centers (Genomrechenzentren, GRZs) and serves as a reference implementation for all Leistungserbringer (LEs).
Important
- Leistungserbringer are not required to use this specific workflow to calculate metrics. Any method that produces reasonably matching results can be used. This workflow will be used by the GRZ's to validate the reported metrics.
- Please note that we are neither permitted nor able to provide direct support for running this QC workflow in hospitals.
- Features such as running on pre-mapped reads are not part of the official requirements, but are offered as helpful additions for LEs when feasible.
- We greatly appreciate collaboration and encourage contributions to help improve the workflow.
This workflow is built using Nextflow and processes data roughly according to the following steps:
- Read QC and trimming (
FastQC
andfastp
/fastplong
)- The pipeline uses the default settings for
fastp
/fastplong
: adapter trimming is enabled and reads with >40% Q15 or lower bases are dropped.
- The pipeline uses the default settings for
- Alignment using
bwa-mem2
orminimap2
- MarkDuplicates using
Sambamba
(short reads only) - Coverage calculation by
mosdepth
- Present QC for raw reads (
MultiQC
)
Details on the coverage calculation of different library types can be found in the documentation.
For the exact command lines executed by the pipeline you may check the workflow reports automatically generated by the test pipeline on GitHub.
To access these reports, click on the title of the latest successful pipeline run and download one of the nextflow-pipeline-info
artifacts under the "Artifacts" section at the bottom of the page.
The command lines are detailed the the "Tasks" table at the bottom of the execution_report_*.html
file.
- Install nextflow (and dependencies)
- Make sure to have either conda, docker or singularity.
- Clone the github repository
git clone https://github.com/BfArM-MVH/GRZ_QC_Workflow.git
output_basepath="path/to/analysis/dir"
mkdir -p ${output_basepath}/grzqc_output
This pipeline needs one of the following two inputs:
-
A submission base directory path with a folder structure following GRZ submission standard. You can also check test datasets.
-
Use a csv samplesheet as input. This gives more flexibility providing optional starting points for the analysis from either reads or alignments, and you won't need a GRZ submission directory.
Here are the instructions for the case (1). For (2) please see the documentation.
You can run the pipeline with flag --submission_basepath
:
submission_basepath="path/to/submission/base/directory"
nextflow run main.nf \
-profile docker \
--outdir "${output_basepath}/grzqc_output/" \
--submission_basepath "${submission_basepath}"
Depending on the resouces on your machine and your task, it is recommended to create and and run with your own config file, see estimated resource requirements for WGS and nextflow documentation.
With the above code, the pipeline can automatically download the necessary reference genomes and creates indices from them. However, when running this pipeline multiple times on different submissions, the download and indexing steps create unnecessary overhead. Further, some environments will not have internet access to download references for each run.
Therefore, it is recommended to run the test_GRCh37
and test_GRCh38
profiles to set up reference files and to be sure that all of the necessary images and containers are set up correctly. Thus, depending on the reference genome you will use, run the profile:
nextflow run main.nf \
-profile test_GRCh37,docker
--reference_path "your/reference/path"
or/and
nextflow run main.nf \
-profile test_GRCh38,docker
--reference_path "your/reference/path"
Please replace docker
with singularity
or conda
depending on your system. This pipeline is able to run with any of those three profiles.
Now, all the necessary files are saved into your/reference/path
and will be re-used for subsequent pipeline runs that specify the same --reference_path
.
You can also safely remove the test pipeline results by running the following from the cloned repository root.
rm -rf tests/results
A more detailed description of reference files usage can be found here.
For more details about the output files and reports, please refer to the output documentation.
Column | Description |
---|---|
sampleId |
Sample ID |
donorPseudonym |
A unique identifier given by the Leistungserbringer for each donor. |
labDataName |
Lab data name |
libraryType |
Library type, e.g., wes for whole-exome sequencing |
sequenceSubtype |
Sequence subtype, e.g., somatic or germline |
genomicStudySubtype |
Genomic study subtype, e.g., tumor+germline |
qualityControlStatus |
Submission quality control status. Only reported if pre-computed metrics provided. |
meanDepthOfCoverage |
Mean depth of coverage |
meanDepthOfCoverageProvided |
Mean depth of coverage from metadata/samplesheet, if provided. |
meanDepthOfCoverageRequired |
Mean depth of coverage required to pass QC |
meanDepthOfCoverageDeviation |
Percent deviation of computed coverage from provided coverage. |
meanDepthOfCoverageQCStatus |
PASS or TOO LOW , depending on percent deviation. |
percentBasesAboveQualityThreshold |
Percent of bases passing the quality threshold |
qualityThreshold |
The quality threshold to pass |
percentBasesAboveQualityThresholdProvided |
Percent of bases passing the quality threshold from metadata/samplesheet, if provided. |
percentBasesAboveQualityThresholdRequired |
Percent of bases above the quality threshold required to pass QC |
percentBasesAboveQualityThresholdDeviation |
Percent deviation of computed metric from provided metric. |
percentBasesAboveQualityThresholdQCStatus |
PASS or TOO LOW , depending on percent devation. |
targetedRegionsAboveMinCoverage |
Fraction of targeted regions above minimum coverage |
minCoverage |
Minimum coverage for target regions |
targetedRegionsAboveMinCoverageProvided |
Fraction of targeted regions above minimum coverage from metadataa/samplesheet, if provided. |
targetedRegionsAboveMinCoverageRequired |
Fraction of targeted regions above minimum coverage required to pass QC |
targetedRegionsAboveMinCoverageQCStatus |
PASS or TOO LOW , depending on percent devation. |
A MultiQC HTML report is also generated by the pipeline. Descriptions of the table columns can be found by hovering over the headers.
QC thresholds are read from thresholds.json
, which uses the values defined by BfArM.
Nextflow can automatically retrieve almost everything necessary to execute a pipeline from the web, including pipeline code, software dependencies, reference genomes, and remote data sources.
However, if your analysis must run on a system without internet access, you will need to take a few additional steps to ensure all required components are available locally. First, download everything on an internet-connected system (such as your personal computer) and then transfer the files to the offline system using your preferred method.
To set up an offline environment, you will need three key components: a functioning Nextflow installation, the pipeline assets, and the required reference genomes.
On a computer with an internet connection, to download the pipeline, run:
nf-core pipelines download BfArM-MVH/GRZ_QC_Workflow
Add the argument --container-system singularity
to also fetch the singularity container(s).
Then download the necessary plugins and lace it under ${NXF_HOME}/plugins
:
nextflow plugin install nf-schema@2.1.1
The default reference files used in this pipeline are as follows:
GRCh37
:s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
GRCh38
:https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz
For more detailed information please check "Running offline by nf-core"
Using the 466 GB WGS_tumor+germline
test submission dataset from the
example GRZ submissions,
the pipeline used the following resources:
if reference build involved:
- 618 CPU hours
- 72 GB maximum RAM (genome indexing)
- 2 TB storage (including the input files)
- Takes around 3 days
The biggest jobs were the two bwa-mem2 alignments which used 300 CPU hours each and a maximum of 32 GB of RAM.
Without reference build:
- 600 to 620 CPU hours
- 56 GB maximum RAM
- 2 TB storage (saving alignment files)
- Takes around 2 and half days.
BfArM-MVH/GRZ_QC_Workflow was originally written by Yun Wang, Kübra Narci, Travis Wrightsman, Shounak Chakraborty and Florian R. Hölzlwimmer.