|
| 1 | +# QC Metrics Summary Specification |
| 2 | + |
| 3 | +## Description |
| 4 | + |
| 5 | +The qc_metrics summary file is a comma-separated file that lists a summary of metadata and qc |
| 6 | +metrics for a dataset created after processing is complete. [Metadata](#metadata-fields) for the |
| 7 | +sample and assay are pulled from the ISA.zip file. QC metrics include |
| 8 | +[raw and trimmed read metrics](#read-qc-metrics), [alignment metrics](#alignment-metrics), |
| 9 | +[gene count metrics](#gene-count-metrics), and [RSeQC metrics](#rseqc-metrics) are pulled from the |
| 10 | +MultiQC reports generated during processing. |
| 11 | + |
| 12 | +> **NOTE:** Eukaryotic and prokaryotic data use different tools for alignment and gene counting, so |
| 13 | +the metrics reported will differ for those data types. Similarly, paired-end data will include |
| 14 | +metrics for the reverse read, while single-end data will not. All data columns will be present in |
| 15 | +the output files regardless of data type. Any fields that are not relevant for a particular data |
| 16 | +type will be left empty. |
| 17 | + |
| 18 | +<br> |
| 19 | +<br> |
| 20 | + |
| 21 | +----------- |
| 22 | +### Metadata fields |
| 23 | +*Source: ISA.zip* |
| 24 | +List selected metadata fields for each sample |
| 25 | + |
| 26 | +| Column Name | Data Source | Description | Example(s) | |
| 27 | +|:----------------------|:--------------------|:----------------------------------------------------------------------------------------------------------|:-----------------------------------------| |
| 28 | +| osd_num | investigation table | Study identifier | "OSD-515" | |
| 29 | +| sample | sample table | Sample Identifier | "RR23_TMS_FLT_F1" | |
| 30 | +| organism | sample table | Genus and Species of primary organism under study | "Mus musculus" | |
| 31 | +| tissue | sample table | 'Material Type' listed in sample table (e.g. organ, tissue, cell culture, whole organism) | "Cells, cultured", "thymus" | |
| 32 | +| sequencing_instrument | assay table | Sequencer used to sequence the samples | "Illumina NovaSeq 6000" | |
| 33 | +| library_selection | assay table | Biomolecule selection (rRNA depletion, mRNA enrichment, target amplification) | "Ribo-depletion", "polyA enrichment" | |
| 34 | +| library_layout | assay table | RNA library construction parameter indicating whether the library was sequenced single- or paired-end. | "SINGLE", "PAIRED" | |
| 35 | +| strandedness | assay table | RNAseq library construction parameter indicating whether the library preserves the transcript orientation | "STRANDED", "UNSTRANDED" | |
| 36 | +| read_depth | assay table | Total number of sequenced fragments | 82186286 | |
| 37 | +| read_length | assay table | Raw data read length in bases | 151 | |
| 38 | +| rrna_contamination | assay table | Percent rRNA found in the data | 2.74 | |
| 39 | +| rin | assay table | RNA integrity number | 7.4 | |
| 40 | +| organism_part | sample table | For plant samples, the part of the plant from which the sample was derived. | "Plant Roots" | |
| 41 | +| cell_line | sample table | Cell line identifier | "IMR90 hiPSC" | |
| 42 | +| cell_type | sample table | The morphological or functional form of the cells used. | "Myocytes, Cardiac" | |
| 43 | +| secondary_organism | sample table | Additional organism(s) which may confound assay measurements (e.g. food source, host, symbiote) | "Vibrio fischeri ES114" | |
| 44 | +| strain | sample table | Strain, breed, ecotype, etc. | "C57BL/6J" | |
| 45 | +| animal_source | sample table | The source of the animal(s) used in the study | "Jackson Laboratory" | |
| 46 | +| seed_source | sample table | The source of the seed(s) used in the study | "Arabidopsis Biological Resource Center" | |
| 47 | +| source_accession | sample table | The source accession number for the animal(s)/seed(s)/cell(s) used in the study | "SALK_027956C" | |
| 48 | +| mix | assay table | The ERCC spike-in mix number used | "Mix 1" | |
| 49 | + |
| 50 | +<br> |
| 51 | +<br> |
| 52 | + |
| 53 | +----------- |
| 54 | +### Read QC metrics |
| 55 | +*Source: Raw and Trimmed MultiQC* |
| 56 | +QC metrics describing the read quality. |
| 57 | +> *NOTE:* the same fields are extracted for both raw and trimmed reads, the prefixes "raw_" and |
| 58 | +"trimmed_" indicate the source of the metric. Similarly, the same fields are also extracted for both |
| 59 | +forward and reverse reads with the suffixes "_f" and "_r" denoting the read type. The table below |
| 60 | +lists each metric only once. |
| 61 | + |
| 62 | +| Column Name | Type | Description | Example(s) | |
| 63 | +|:-----------------------|:------|:------------------------------------------------------------------------------------------------------------------|:------------| |
| 64 | +| total_sequences | int | Total number of sequences for this read type | 82186286 | |
| 65 | +| avg_sequence_length | float | Average sequence length for this read type | 138.7187431 | |
| 66 | +| median_sequence_length | float | Median sequence length for this read type | 151 | |
| 67 | +| quality_score_mean | float | Average quality score | 35.48372691 | |
| 68 | +| quality_score_median | float | Median quality score | 35.61967608 | |
| 69 | +| percent_duplicates | float | Percentage estimated sequence duplication, measured based on first 50bp of the first 100,000 reads in the dataset | 67.01577646 | |
| 70 | +| percent_gc | float | Overall %GC of all bases in all sequences for this read type | 48 | |
| 71 | +| gc_min_1pct | int | Minimum %GC value reached by at least 1% of the total reads of this read type | 31 | |
| 72 | +| gc_max_1pct | int | Maximum %GC value reached by at least 1% of the total reads of this read type | 65 | |
| 73 | +| gc_auc_25pct | int | %GC value at the 25 quartile of total reads of this read | 42 | |
| 74 | +| gc_auc_50pct | int | %GC value at the 50th quartile of total reads of this read type | 49 | |
| 75 | +| gc_auc_75pct | int | %GC value at the 75th quartile of total reads of this read type | 57 | |
| 76 | +| n_content_sum | float | %N base calls summed across all base positions in the read for this read type | 1.510487284 | |
| 77 | + |
| 78 | +<br> |
| 79 | +<br> |
| 80 | + |
| 81 | +----------- |
| 82 | +### Alignment metrics |
| 83 | + |
| 84 | +#### STAR alignment metrics |
| 85 | +*Source: STAR MultiQC* |
| 86 | +Alignment metrics generated by STAR. |
| 87 | +> *Present only for data processed with the eukaryotic pipeline.* |
| 88 | +
|
| 89 | +| Column Name | Type | Description | Example | |
| 90 | +|:----------------------------|:------|:------------------------------------------------------------------------------------|:--------| |
| 91 | +| uniquely_mapped_percent | float | Percent of reads mapped uniquely in the genome | 79.95 | |
| 92 | +| multimapped_percent | float | Percent of reads mapped to multiple loci in the genome | 13.14 | |
| 93 | +| multimapped_toomany_percent | float | Percent of reads mapped to 20 or more loci in the genome | 0.13 | |
| 94 | +| unmapped_tooshort_percent | float | Percent of reads where the best alignment is shorter than the allowed mapped length | 6.53 | |
| 95 | +| unmapped_other_percent | float | Percent of reads that couldn't be mapped at all | 0.25 | |
| 96 | + |
| 97 | +#### Bowtie2 alignment metrics |
| 98 | +*Source: bowtie2 MultiQC* |
| 99 | +Alignment metrics generated by bowtie2. |
| 100 | +> *Present only for data processed with the prokaryotic pipeline.* |
| 101 | +
|
| 102 | +| Column Name | Type | Description | Example | |
| 103 | +|:-----------------------|:------|:-----------------------------------------------------------------------------------------------------------------|:---------| |
| 104 | +| total_reads | int | total input reads (or read-pairs for paired-end data) | 15066949 | |
| 105 | +| overall_alignment_rate | float | percentage of input reads that mapped to the genome (includes discordant and partial alignments). | 98.03 | |
| 106 | +| aligned_none | int | number of reads (or read-pairs) that aligned 0 times to the reference genome (concordantly if paired-end) | 516325 | |
| 107 | +| aligned_one | int | number of reads (or read-pairs) that aligned exactly 1 time to the reference genome (concordantly if paired-end) | 11294617 | |
| 108 | +| aligned_multi | int | number of reads (or read-pairs) that aligned > 1 times to the reference genome (concordantly if paired-end) | 3256007 | |
| 109 | + |
| 110 | +<br> |
| 111 | +<br> |
| 112 | + |
| 113 | +----------- |
| 114 | +### Gene count metrics |
| 115 | + |
| 116 | +#### Common gene count metrics |
| 117 | +*Source: RSEM_Unnormalized_Counts_GLbulkRNAseq.csv or FeatureCounts_Unnormalized_Counts_GLbulkRNAseq.csv* |
| 118 | +Summarizes the data in the Unnormalized_Counts_GLbulkRNASeq.csv files. |
| 119 | +> *Present for data processed with either the eukaryotic or prokaryotic pipelines.* |
| 120 | +
|
| 121 | +| Column Name | Type | Description | Example | |
| 122 | +|:-----------------------|:------|:--------------------------------------------------|:-----------| |
| 123 | +| gene_total | int | total number of genes detected | 57186 | |
| 124 | +| gene_detected_gt10 | int | number of genes detected with > 10 read depth | 19207 | |
| 125 | +| gene_detected_gt10_pct | float | percentage of genes detected with > 10 read depth | 33.5868919 | |
| 126 | + |
| 127 | + |
| 128 | +#### RSEM gene count metrics |
| 129 | +*Source: RSEM MultiQC* |
| 130 | +Summarizes the alignment rates generated by RSEM during gene expression estimation. |
| 131 | +> *Present only for data processed with the eukaryotic pipeline.* |
| 132 | +
|
| 133 | +| Column Name | Type | Description | Example | |
| 134 | +|:-----------------------|:------|:--------------------------------------------------------|:------------| |
| 135 | +| num_uniquely_aligned | int | number of reads aligned uniquely to a gene | 41152050 | |
| 136 | +| pct_uniquely_aligned | float | percentage of reads aligned unique to a gene | 74.48868329 | |
| 137 | +| pct_multi_aligned | float | percentage of reads aligned to multiple genes | 15.27982918 | |
| 138 | +| pct_filtered | float | percentage of reads filtered due to too many alignments | 0 | |
| 139 | +| pct_unalignable | float | percentage of reads unalignable to any gene | 10.23148753 | |
| 140 | + |
| 141 | +#### FeatureCounts gene count metrics |
| 142 | +*Source: FeatureCounts MultiQC* |
| 143 | +Summarizes the feature counts assigned to genome features. |
| 144 | +> *Present only for data processed with the prokaryotic pipeline.* |
| 145 | +
|
| 146 | +| Column Name | Type | Description | Example | |
| 147 | +|:--------------------------|:------|:---------------------------------------------------------|:------------| |
| 148 | +| total_count | int | total number of input alignments | 496 | |
| 149 | +| num_assigned | int | number of alignments assigned to a feature | 431 | |
| 150 | +| pct_assigned | float | percentage of alignments assigned to a feature | 86.89516129 | |
| 151 | +| num_unassigned_nofeatures | int | number of alignments that overlap no features | 19 | |
| 152 | +| num_unassigned_ambiguity | int | number of alignments that overlap 2 or more features | 43 | |
| 153 | +| pct_unassigned_nofeatures | float | percentage of alignments that overlap no features | 3.830645161 | |
| 154 | +| pct_unassigned_ambiguity | float | percentage of alignments that overlap 2 or more features | 8.669354839 | |
| 155 | + |
| 156 | +<br> |
| 157 | +<br> |
| 158 | + |
| 159 | +----------- |
| 160 | +### RSeQC metrics |
| 161 | + |
| 162 | +#### Genebody coverage metrics |
| 163 | +*Source: RSeQC Gene Body Coverage MultiQC* |
| 164 | +Summarizes the read coverage over gene bodies. Used to check if the coverage is uniform and if |
| 165 | +there is any 5' or 3' bias. |
| 166 | + |
| 167 | +| Column Name | Type | Description | Example | |
| 168 | +|:--------------------------|:------|:-------------------------------------------------------------------------------------|:------------| |
| 169 | +| mean_genebody_cov_5_20 | float | average read coverage between the 5th and 20th gene body percentile from the 5' end | 70.60419705 | |
| 170 | +| mean_genebody_cov_40_60 | float | average read coverage between the 40th and 60th gene body percentile from the 5' end | 99.38086546 | |
| 171 | +| mean_genebody_cov_80_95 | float | average read coverage between the 80th and 95th gene body percentile from the 5' end | 87.1712315 | |
| 172 | +| ratio_genebody_cov_3_to_5 | float | ratio of the 3' (mean_genebody_cov_80_95) to 5' (mean_genebody_cov_5_20) gene body coverage | 1.234646595 | |
| 173 | + |
| 174 | +#### Infer experiment metrics |
| 175 | +*Source: RSeQC Infer Experiment MultiQC* |
| 176 | +Summarizes the percentage of reads and read pairs that match the strandedness of the overlapping |
| 177 | +transcripts. Can be used to infer whether the RNAseq library prep was stranded or unstranded. |
| 178 | + |
| 179 | +| Column Name | Type | Description | Example | |
| 180 | +|:--------------------------|:------|:-------------------------------------------------------------|:--------| |
| 181 | +| pct_sense | float | percentage of reads aligned to the sense strand | 91.16 | |
| 182 | +| pct_antisense | float | percentage of reads aligned to the antisense strand | 1.76 | |
| 183 | +| pct_undetermined | float | percentage of reads where the strand could not be determined | 7.08 | |
| 184 | + |
| 185 | +#### Inner distance metrics |
| 186 | +*Source: RSeQC Inner Distance MultiQC* |
| 187 | +Summarizes the inner distance (or insert size) between two paired RNA reads. Note that this can be |
| 188 | +negative if the two reads overlap. |
| 189 | +> *Present only for paired-end data.* |
| 190 | +
|
| 191 | +| Column Name | Type | Description | Example | |
| 192 | +|:--------------------------|:------|:----------------------------------------------------|:------------| |
| 193 | +| peak_inner_dist | float | The inner distance at the peak of the distribution | -123.5 | |
| 194 | +| peak_inner_dist_pct_reads | float | percentage of reads at the peak of the distribution | 5.112384053 | |
| 195 | + |
| 196 | +#### Read distribution metrics |
| 197 | +*Source: RSeQC Read Distribution MultiQC* |
| 198 | +Summarizes the distribution of reads over genome features. |
| 199 | + |
| 200 | +| Column Name | Type | Description | Example. | |
| 201 | +|:--------------------------|:------|:---------------------------------------------------------------------------|:------------| |
| 202 | +| cds_exons_pct | float | percentage of reads aligned to CDS exons | 43.5637117 | |
| 203 | +| 5_utr_exons_pct | float | percentage of reads aligned to 5' UTRs | 8.886329966 | |
| 204 | +| 3_utr_exons_pct | float | percentage of reads aligned to 3' UTRs | 21.41150152 | |
| 205 | +| introns_pct | float | percentage of reads aligned to introns | 20.55254242 | |
| 206 | +| tss_up_1kb_pct | float | percentage of reads aligned 1 kb upstream of a transcription start site | 0.086612962 | |
| 207 | +| tss_up_1kb_5kb_pct | float | percentage of reads aligned 1-5 kb upstream of a transcription start site | 0.389163859 | |
| 208 | +| tss_up_5kb_10kb_pct | float | percentage of reads aligned 5-10 kb upstream of a transcription start site | 0.116168783 | |
| 209 | +| tes_down_1kb_pct | float | percentage of reads aligned 1 kb downstream of a transcription end site | 0.276327172 | |
| 210 | +| tss_down_1kb_5kb_pct | float | percentage of reads aligned 1-5 kb downstream of a transcription end site | 0.419487364 | |
| 211 | +| tss_down_5kb_10kb_pct | float | percentage of reads aligned 5-10 kb downstream of a transcription end site | 0.137565531 | |
| 212 | +| other_intergenic_pct | float | percentage of reads aligned to other intergenic regions | 4.160588723 | |
| 213 | + |
0 commit comments