Skip to content

Commit 1647aa9

Browse files
committed
Added QC metrics definitions
1 parent f766a56 commit 1647aa9

File tree

2 files changed

+214
-2
lines changed

2 files changed

+214
-2
lines changed
Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# QC Metrics Summary Specification
2+
3+
## Description
4+
5+
The qc_metrics summary file is a comma-separated file that lists a summary of metadata and qc
6+
metrics for a dataset created after processing is complete. [Metadata](#metadata-fields) for the
7+
sample and assay are pulled from the ISA.zip file. QC metrics include
8+
[raw and trimmed read metrics](#read-qc-metrics), [alignment metrics](#alignment-metrics),
9+
[gene count metrics](#gene-count-metrics), and [RSeQC metrics](#rseqc-metrics) are pulled from the
10+
MultiQC reports generated during processing.
11+
12+
> **NOTE:** Eukaryotic and prokaryotic data use different tools for alignment and gene counting, so
13+
the metrics reported will differ for those data types. Similarly, paired-end data will include
14+
metrics for the reverse read, while single-end data will not. All data columns will be present in
15+
the output files regardless of data type. Any fields that are not relevant for a particular data
16+
type will be left empty.
17+
18+
<br>
19+
<br>
20+
21+
-----------
22+
### Metadata fields
23+
*Source: ISA.zip*
24+
List selected metadata fields for each sample
25+
26+
| Column Name | Data Source | Description | Example(s) |
27+
|:----------------------|:--------------------|:----------------------------------------------------------------------------------------------------------|:-----------------------------------------|
28+
| osd_num | investigation table | Study identifier | "OSD-515" |
29+
| sample | sample table | Sample Identifier | "RR23_TMS_FLT_F1" |
30+
| organism | sample table | Genus and Species of primary organism under study | "Mus musculus" |
31+
| tissue | sample table | 'Material Type' listed in sample table (e.g. organ, tissue, cell culture, whole organism) | "Cells, cultured", "thymus" |
32+
| sequencing_instrument | assay table | Sequencer used to sequence the samples | "Illumina NovaSeq 6000" |
33+
| library_selection | assay table | Biomolecule selection (rRNA depletion, mRNA enrichment, target amplification) | "Ribo-depletion", "polyA enrichment" |
34+
| library_layout | assay table | RNA library construction parameter indicating whether the library was sequenced single- or paired-end. | "SINGLE", "PAIRED" |
35+
| strandedness | assay table | RNAseq library construction parameter indicating whether the library preserves the transcript orientation | "STRANDED", "UNSTRANDED" |
36+
| read_depth | assay table | Total number of sequenced fragments | 82186286 |
37+
| read_length | assay table | Raw data read length in bases | 151 |
38+
| rrna_contamination | assay table | Percent rRNA found in the data | 2.74 |
39+
| rin | assay table | RNA integrity number | 7.4 |
40+
| organism_part | sample table | For plant samples, the part of the plant from which the sample was derived. | "Plant Roots" |
41+
| cell_line | sample table | Cell line identifier | "IMR90 hiPSC" |
42+
| cell_type | sample table | The morphological or functional form of the cells used. | "Myocytes, Cardiac" |
43+
| secondary_organism | sample table | Additional organism(s) which may confound assay measurements (e.g. food source, host, symbiote) | "Vibrio fischeri ES114" |
44+
| strain | sample table | Strain, breed, ecotype, etc. | "C57BL/6J" |
45+
| animal_source | sample table | The source of the animal(s) used in the study | "Jackson Laboratory" |
46+
| seed_source | sample table | The source of the seed(s) used in the study | "Arabidopsis Biological Resource Center" |
47+
| source_accession | sample table | The source accession number for the animal(s)/seed(s)/cell(s) used in the study | "SALK_027956C" |
48+
| mix | assay table | The ERCC spike-in mix number used | "Mix 1" |
49+
50+
<br>
51+
<br>
52+
53+
-----------
54+
### Read QC metrics
55+
*Source: Raw and Trimmed MultiQC*
56+
QC metrics describing the read quality.
57+
> *NOTE:* the same fields are extracted for both raw and trimmed reads, the prefixes "raw_" and
58+
"trimmed_" indicate the source of the metric. Similarly, the same fields are also extracted for both
59+
forward and reverse reads with the suffixes "_f" and "_r" denoting the read type. The table below
60+
lists each metric only once.
61+
62+
| Column Name | Type | Description | Example(s) |
63+
|:-----------------------|:------|:------------------------------------------------------------------------------------------------------------------|:------------|
64+
| total_sequences | int | Total number of sequences for this read type | 82186286 |
65+
| avg_sequence_length | float | Average sequence length for this read type | 138.7187431 |
66+
| median_sequence_length | float | Median sequence length for this read type | 151 |
67+
| quality_score_mean | float | Average quality score | 35.48372691 |
68+
| quality_score_median | float | Median quality score | 35.61967608 |
69+
| percent_duplicates | float | Percentage estimated sequence duplication, measured based on first 50bp of the first 100,000 reads in the dataset | 67.01577646 |
70+
| percent_gc | float | Overall %GC of all bases in all sequences for this read type | 48 |
71+
| gc_min_1pct | int | Minimum %GC value reached by at least 1% of the total reads of this read type | 31 |
72+
| gc_max_1pct | int | Maximum %GC value reached by at least 1% of the total reads of this read type | 65 |
73+
| gc_auc_25pct | int | %GC value at the 25 quartile of total reads of this read | 42 |
74+
| gc_auc_50pct | int | %GC value at the 50th quartile of total reads of this read type | 49 |
75+
| gc_auc_75pct | int | %GC value at the 75th quartile of total reads of this read type | 57 |
76+
| n_content_sum | float | %N base calls summed across all base positions in the read for this read type | 1.510487284 |
77+
78+
<br>
79+
<br>
80+
81+
-----------
82+
### Alignment metrics
83+
84+
#### STAR alignment metrics
85+
*Source: STAR MultiQC*
86+
Alignment metrics generated by STAR.
87+
> *Present only for data processed with the eukaryotic pipeline.*
88+
89+
| Column Name | Type | Description | Example |
90+
|:----------------------------|:------|:------------------------------------------------------------------------------------|:--------|
91+
| uniquely_mapped_percent | float | Percent of reads mapped uniquely in the genome | 79.95 |
92+
| multimapped_percent | float | Percent of reads mapped to multiple loci in the genome | 13.14 |
93+
| multimapped_toomany_percent | float | Percent of reads mapped to 20 or more loci in the genome | 0.13 |
94+
| unmapped_tooshort_percent | float | Percent of reads where the best alignment is shorter than the allowed mapped length | 6.53 |
95+
| unmapped_other_percent | float | Percent of reads that couldn't be mapped at all | 0.25 |
96+
97+
#### Bowtie2 alignment metrics
98+
*Source: bowtie2 MultiQC*
99+
Alignment metrics generated by bowtie2.
100+
> *Present only for data processed with the prokaryotic pipeline.*
101+
102+
| Column Name | Type | Description | Example |
103+
|:-----------------------|:------|:-----------------------------------------------------------------------------------------------------------------|:---------|
104+
| total_reads | int | total input reads (or read-pairs for paired-end data) | 15066949 |
105+
| overall_alignment_rate | float | percentage of input reads that mapped to the genome (includes discordant and partial alignments). | 98.03 |
106+
| aligned_none | int | number of reads (or read-pairs) that aligned 0 times to the reference genome (concordantly if paired-end) | 516325 |
107+
| aligned_one | int | number of reads (or read-pairs) that aligned exactly 1 time to the reference genome (concordantly if paired-end) | 11294617 |
108+
| aligned_multi | int | number of reads (or read-pairs) that aligned > 1 times to the reference genome (concordantly if paired-end) | 3256007 |
109+
110+
<br>
111+
<br>
112+
113+
-----------
114+
### Gene count metrics
115+
116+
#### Common gene count metrics
117+
*Source: RSEM_Unnormalized_Counts_GLbulkRNAseq.csv or FeatureCounts_Unnormalized_Counts_GLbulkRNAseq.csv*
118+
Summarizes the data in the Unnormalized_Counts_GLbulkRNASeq.csv files.
119+
> *Present for data processed with either the eukaryotic or prokaryotic pipelines.*
120+
121+
| Column Name | Type | Description | Example |
122+
|:-----------------------|:------|:--------------------------------------------------|:-----------|
123+
| gene_total | int | total number of genes detected | 57186 |
124+
| gene_detected_gt10 | int | number of genes detected with > 10 read depth | 19207 |
125+
| gene_detected_gt10_pct | float | percentage of genes detected with > 10 read depth | 33.5868919 |
126+
127+
128+
#### RSEM gene count metrics
129+
*Source: RSEM MultiQC*
130+
Summarizes the alignment rates generated by RSEM during gene expression estimation.
131+
> *Present only for data processed with the eukaryotic pipeline.*
132+
133+
| Column Name | Type | Description | Example |
134+
|:-----------------------|:------|:--------------------------------------------------------|:------------|
135+
| num_uniquely_aligned | int | number of reads aligned uniquely to a gene | 41152050 |
136+
| pct_uniquely_aligned | float | percentage of reads aligned unique to a gene | 74.48868329 |
137+
| pct_multi_aligned | float | percentage of reads aligned to multiple genes | 15.27982918 |
138+
| pct_filtered | float | percentage of reads filtered due to too many alignments | 0 |
139+
| pct_unalignable | float | percentage of reads unalignable to any gene | 10.23148753 |
140+
141+
#### FeatureCounts gene count metrics
142+
*Source: FeatureCounts MultiQC*
143+
Summarizes the feature counts assigned to genome features.
144+
> *Present only for data processed with the prokaryotic pipeline.*
145+
146+
| Column Name | Type | Description | Example |
147+
|:--------------------------|:------|:---------------------------------------------------------|:------------|
148+
| total_count | int | total number of input alignments | 496 |
149+
| num_assigned | int | number of alignments assigned to a feature | 431 |
150+
| pct_assigned | float | percentage of alignments assigned to a feature | 86.89516129 |
151+
| num_unassigned_nofeatures | int | number of alignments that overlap no features | 19 |
152+
| num_unassigned_ambiguity | int | number of alignments that overlap 2 or more features | 43 |
153+
| pct_unassigned_nofeatures | float | percentage of alignments that overlap no features | 3.830645161 |
154+
| pct_unassigned_ambiguity | float | percentage of alignments that overlap 2 or more features | 8.669354839 |
155+
156+
<br>
157+
<br>
158+
159+
-----------
160+
### RSeQC metrics
161+
162+
#### Genebody coverage metrics
163+
*Source: RSeQC Gene Body Coverage MultiQC*
164+
Summarizes the read coverage over gene bodies. Used to check if the coverage is uniform and if
165+
there is any 5' or 3' bias.
166+
167+
| Column Name | Type | Description | Example |
168+
|:--------------------------|:------|:-------------------------------------------------------------------------------------|:------------|
169+
| mean_genebody_cov_5_20 | float | average read coverage between the 5th and 20th gene body percentile from the 5' end | 70.60419705 |
170+
| mean_genebody_cov_40_60 | float | average read coverage between the 40th and 60th gene body percentile from the 5' end | 99.38086546 |
171+
| mean_genebody_cov_80_95 | float | average read coverage between the 80th and 95th gene body percentile from the 5' end | 87.1712315 |
172+
| ratio_genebody_cov_3_to_5 | float | ratio of the 3' (mean_genebody_cov_80_95) to 5' (mean_genebody_cov_5_20) gene body coverage | 1.234646595 |
173+
174+
#### Infer experiment metrics
175+
*Source: RSeQC Infer Experiment MultiQC*
176+
Summarizes the percentage of reads and read pairs that match the strandedness of the overlapping
177+
transcripts. Can be used to infer whether the RNAseq library prep was stranded or unstranded.
178+
179+
| Column Name | Type | Description | Example |
180+
|:--------------------------|:------|:-------------------------------------------------------------|:--------|
181+
| pct_sense | float | percentage of reads aligned to the sense strand | 91.16 |
182+
| pct_antisense | float | percentage of reads aligned to the antisense strand | 1.76 |
183+
| pct_undetermined | float | percentage of reads where the strand could not be determined | 7.08 |
184+
185+
#### Inner distance metrics
186+
*Source: RSeQC Inner Distance MultiQC*
187+
Summarizes the inner distance (or insert size) between two paired RNA reads. Note that this can be
188+
negative if the two reads overlap.
189+
> *Present only for paired-end data.*
190+
191+
| Column Name | Type | Description | Example |
192+
|:--------------------------|:------|:----------------------------------------------------|:------------|
193+
| peak_inner_dist | float | The inner distance at the peak of the distribution | -123.5 |
194+
| peak_inner_dist_pct_reads | float | percentage of reads at the peak of the distribution | 5.112384053 |
195+
196+
#### Read distribution metrics
197+
*Source: RSeQC Read Distribution MultiQC*
198+
Summarizes the distribution of reads over genome features.
199+
200+
| Column Name | Type | Description | Example. |
201+
|:--------------------------|:------|:---------------------------------------------------------------------------|:------------|
202+
| cds_exons_pct | float | percentage of reads aligned to CDS exons | 43.5637117 |
203+
| 5_utr_exons_pct | float | percentage of reads aligned to 5' UTRs | 8.886329966 |
204+
| 3_utr_exons_pct | float | percentage of reads aligned to 3' UTRs | 21.41150152 |
205+
| introns_pct | float | percentage of reads aligned to introns | 20.55254242 |
206+
| tss_up_1kb_pct | float | percentage of reads aligned 1 kb upstream of a transcription start site | 0.086612962 |
207+
| tss_up_1kb_5kb_pct | float | percentage of reads aligned 1-5 kb upstream of a transcription start site | 0.389163859 |
208+
| tss_up_5kb_10kb_pct | float | percentage of reads aligned 5-10 kb upstream of a transcription start site | 0.116168783 |
209+
| tes_down_1kb_pct | float | percentage of reads aligned 1 kb downstream of a transcription end site | 0.276327172 |
210+
| tss_down_1kb_5kb_pct | float | percentage of reads aligned 1-5 kb downstream of a transcription end site | 0.419487364 |
211+
| tss_down_5kb_10kb_pct | float | percentage of reads aligned 5-10 kb downstream of a transcription end site | 0.137565531 |
212+
| other_intergenic_pct | float | percentage of reads aligned to other intergenic regions | 4.160588723 |
213+

RNAseq/Workflow_Documentation/NF_RCP/README.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -357,8 +357,7 @@ The outputs from the Analysis Staging and V&V Pipeline Subworkflows are describe
357357
**QC metrics summary**
358358
359359
- Output:
360-
- GeneLab/qc_metrics_GLbulkRNAseq.csv (comma-separated text file listing a summary of qc metrics and metadata for the dataset)
361-
- GeneLab/qc_validation_GLbulkRNAseq.txt (a validation report for qc_metrics generation that lists any missing entries in the qc_metrics file)
360+
- GeneLab/qc_metrics_GLbulkRNAseq.csv (comma-separated text file containing a summary of qc metrics and metadata for the dataset, see the [QC metrics README](./QC_metrics_README.md) for a complete list of field definitions)
362361
<br>
363362
364363
Standard Nextflow resource usage logs are also produced as follows:

0 commit comments

Comments
 (0)