Fix missed glossary links in scRNA sequencing and raw data processing (#361)

seohyonkim · web-flow · commit 76b904c84837 · 2025-04-24T23:04:57.000+02:00
* fix glossary links in raw data processing

* fix glossary links in scrna seq

* fix glossary links in glossary

* remove multiple entries in glossary
diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md
@@ -2,34 +2,26 @@
 
 ```{glossary}
 Adapter sequences
-adapter sequences
     Short, synthetic DNA or RNA sequences that are ligated to the ends of DNA or RNA fragments during library preparation for sequencing.
     These adapters are essential for binding the fragments to the flowcell and enabling amplification and sequencing.
     However, if adapters are not trimmed after sequencing, they can appear in the reads, potentially interfering with alignment and downstream analyses.
 
 Algorithm
-Algorithms
     A pre-defined set of instructions to solve a problem.
 
 AnnData
-AnnDatas
     A Python package for handling annotated data matrices, commonly used in single-cell and other omics analyses.
     It provides an efficient way to store data as a matrix where rows (observations) and columns (features) can have associated metadata.
     [AnnData](https://anndata.readthedocs.io/en/latest/index.html) supports slicing, subsetting, and saving to disk in formats like H5AD and Zarr.
 
 BAM
-BAM files
     BAM files are binary, compressed versions of SAM (Sequence Alignment/Map) files that store sequencing read alignments to a reference genome.
     They contain the same information as {term}`SAM` files - including read sequences, quality scores, and alignment positions - but in a more space-efficient format that enables faster processing and reduced storage requirements.
 
 Amplification bias
     A distortion that occurs during DNA or RNA amplification (e.g., PCR), where certain sequences are copied more efficiently than others. This can lead to uneven or inaccurate representation of the original genetic material, affecting results in experiments like sequencing or gene expression analysis.
 
 Barcode
-Barcodes
-Bar code
-Bar code
-Cell barcode
     Short DNA barcode fragments ("tags") that are used to identify reads originating from the same cell.
     Reads are later grouped by their barcode during raw data processing steps.
 
@@ -41,25 +33,22 @@ Benchmark
     An (independent) comparison of performance of several tools with respect to pre-defined metrics.
 
 Bulk RNA sequencing
-bulk RNA-Seq
-bulk sequencing
     Contrary to single-cell sequencing, bulk sequencing measures the average expression values of several cells. Therefore, resolution is lost, but bulk sequencing is usually cheaper, less laborious and faster to analyze.
 
 Cell
-cells
     The fundamental unit of life, consisting of cytoplasm enclosed within a membrane, containing biomolecules such as proteins and nucleic acids.
     Cells acquire specific functions, transition into different types, divide, and communicate to sustain an organism.
     Studying cell structure, activity, and interactions enables insights into gene expression dynamics, cellular trajectories, developmental lineages, and disease mechanisms.
 
 Cell type annotation
-    The process of labeling groups of {term}`clusters` of cells by {term}`cell type`.
-    Commonly done based on {term}`cell type` specific markers, automatically with classifiers or by mapping against a reference.
+    The process of labeling groups of {term}`clusters <Cluster>` of cells by {term}`cell type <Cell type>`.
+    Commonly done based on cell type specific markers, automatically with classifiers or by mapping against a reference.
 
 Cell type
     Cells that share common morphological or phenotypic features.
 
 Cell state
-    Cells can be annotated according to {term}`cell type` or other cell states as defined by the cell-cycle, perturbational state or other features.
+    Cells can be annotated according to {term}`cell type <Cell type>` or other cell states as defined by the cell-cycle, perturbational state or other features.
 
 Chromatin
     The complex of DNA and proteins efficiently packaging the DNA inside the nucleus and involved in regulating gene expression.
@@ -74,17 +63,15 @@ CpG
     Unmethylated CpG sites are associated with gene activation, while methylated CpG sites can lead to gene inhibition.
 
 Cluster
-Clusters
     A group of a population or data points that share similarities.
-    In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation`).
+    In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation <Cell type annotation>`).
 
 Complementary DNA (cDNA)
-cDNA
     DNA synthesized from an RNA template by the enzyme reverse transcriptase.
     cDNA is commonly used in RNA-seq library preparation because it is more stable than RNA and allows the captured transcripts to be amplified and sequenced for gene expression analysis.
 
 Demultiplexing
-    The process of determining which sequencing reads belong to which cell using {term}`barcodes`.
+    The process of determining which sequencing reads belong to which cell using {term}`barcodes <Barcode>`.
 
 directed graph
     A directed graph (or digraph) is a graph consisting of a set of nodes (vertices) connected by edges (arcs), where each edge has a direction indicating a one-way relationship between nodes.
@@ -98,12 +85,11 @@ Doublets
     Reads obtained from droplet based assays might be mistakenly associated to a single cell while the RNA expression origins from two or more cells (a doublet).
 
 Downstream analysis
-downstream analyses
     A phase of data analysis that follows the initial processing of raw data.
     In the context of scRNA-seq, this includes tasks such as normalization, integration, filtering, cell type identification, trajectory inference, and studying expression dynamics.
 
 Dropout
-    A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type`.
+    A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type <Cell type>`.
     The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression.
     Dropouts are one of the reasons why scRNA-seq data is sparse.
 
@@ -114,13 +100,11 @@ Edit distance
     Edit distance (often referred to as Levenshtein distance) measures the minimum number of operations (Substitution, Insertion, Deletion) required to transform one string into another.
 
 FASTQ
-FASTQ reads
     Sequencing reads that are saved in the FASTQ format.
     A FASTQ file stores DNA/RNA sequences and their corresponding quality scores in a 4-line format: identifier, sequence, optional description, and quality scores encoded in ASCII characters.
     FASTQ files are then used to map against the reference genome of interest to obtain gene counts for cells.
 
 Flowcell
-flowcell
     A consumable device used in sequencing platforms where DNA or RNA fragments are sequenced.
     It consists of a glass or polymer surface with lanes or channels coated with oligonucleotides, which capture and anchor DNA or RNA fragments.
     During sequencing, these fragments are amplified into clusters, and their sequences are determined by detecting fluorescent signals emitted during nucleotide incorporation.
@@ -143,14 +127,11 @@ Library
     Also known as sequencing library. A pool of DNA fragments with attached sequencing adapters.
 
 Modalities
-Multimodal
     Different types of biological information measured at the single-cell level.
     These include gene expression, chromatin accessibility, surface proteins, immune receptor sequences, and spatial organization.
     Combining these modalities provides a more complete understanding of cell identity, function, and interactions.
 
 Locus
-Loci
-loci
     Specific position or region on a genome or transcriptome where a particular sequence or genetic feature is located.
     In sequencing, loci refer to the potential origins of a read or fragment, such as a gene, exon, or intergenic region.
     Accurate identification of loci is critical for mapping reads and understanding the genomic or transcriptomic context of the data.
@@ -160,7 +141,6 @@ MuData
     The primary data structure in the scverse ecosystem for multimodal data.
 
 Muon
-muon
     A Python package for multi-modal single-cell analysis in Python by scverse.
 
 Negative binomial distribution
@@ -198,7 +178,6 @@ RT-qPCR
     Quantitative reverse transcription {term}`PCR` (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR.
 
 SAM
-SAM files
     SAM (Sequence Alignment/Map) files are tab-delimited text files that store sequencing alignment data, showing how sequencing reads map to a reference genome.
     Each line in a SAM file contains information about a single read alignment, including the read sequence, base quality scores, mapping position, and mapping quality.
 
@@ -219,7 +198,6 @@ Spike-in RNA
     RNA transcripts of known sequence and quantity to calibrate measurements in RNA hybridization steps for RNA-seq.
 
 Splice Junctions
-splice junctions
     Locations where introns are removed, and exons are joined together in a mature RNA transcript during RNA splicing.
     These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional mRNA.
 
@@ -228,12 +206,10 @@ Trajectory inference
     The computational recovery of dynamic processes by ordering cells by similarity or other means.
 
 Unique Molecular Identifier (UMI)
-UMI
     A special type of molecular barcode that uniquely tags each molecule in a sample library.
-    This, for example, enables the estimation of PCR duplication rates (see {term}`amplification bias`), which leads to error correction and increases accuracy.
+    This, for example, enables the estimation of PCR duplication rates (see {term}`amplification bias <Amplification bias>`), which leads to error correction and increases accuracy.
 
 Untranslated Region (UTR)
-UTR
     A segment of an mRNA transcript that is transcribed but not translated into protein.
     UTRs are located at both ends of the coding sequence.
 ```
diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md
@@ -16,15 +16,15 @@ An overview of the topics discussed in this chapter. In the plot, "txome" stands
 :::
 
 The count matrix is the foundation for a wide range of scRNA-seq analyses {cite}`Zappia2021_raw`, including cell type identification or developmental trajectory inference.
-A robust and accurate count matrix is essential for reliable {term}`downstream analyses`.
+A robust and accurate count matrix is essential for reliable {term}`downstream analyses <Downstream analysis>`.
 Errors at this stage can lead to invalid conclusions and discoveries based on missed insights, or distorted signals in the data.
 Despite the straightforward nature of the input (FASTQ files) and the desired output (count matrix), raw data processing presents several technical challenges.
 
 In this section, we focus on key steps of raw data processing:
 
 1. Read alignment/mapping
 2. Cell barcode (CB) identification and correction
-3. Estimation of molecule counts through {term}`unique molecular identifiers (UMIs)`
+3. Estimation of molecule counts through {term}`unique molecular identifiers (UMIs) <Unique Molecular Identifier (UMI)>`
 
 We also discuss the challenges and trade-offs involved in each step.
 
@@ -103,7 +103,7 @@ A good (left) and a bad (right) per-read sequence quality graph.
 
 **3. Per tile sequence quality**
 
-Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each {term}`flowcell` [tile](https://www.biostars.org/p/9461090/)(miniature imaging areas of the {term}`flowcell`).
+Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each {term}` <Flowcell>` [tile](https://www.biostars.org/p/9461090/)(miniature imaging areas of the {term}`flowcell <Flowcell>`).
 The plot uses a color gradient to represent deviations, where warmer colors indicate larger deviations.
 High-quality data typically display a uniform blue color across the plot, indicating consistent quality across all tiles of the flowcell.
 
@@ -221,7 +221,7 @@ An overrepresented sequence table.
 
 **11. Adapter content**
 
-The adapter content module displays the cumulative percentage of reads containing {term}`adapter sequences` at each base position.
+The adapter content module displays the cumulative percentage of reads containing {term}`adapter sequences <Adapter sequences>` at each base position.
 High levels of adapter sequences indicate incomplete removal of adapters during library preparation, which can interfere with downstream analyses.
 Ideally, no significant adapter content should be present in the data.
 If adapter sequences are abundant, additional trimming may be necessary to improve data quality.
@@ -241,14 +241,14 @@ Multiple FastQC reports can be combined into a single report using the tool [`Mu
 ## Alignment and mapping
 
 Mapping or Alignment is a critical step in single-cell raw data processing.
-It involves determining the potential {term}`loci` of origin for each sequenced fragment, such as the genomic or transcriptomic locations that closely match the read sequence.
+It involves determining the potential {term}`loci <Locus>` of origin for each sequenced fragment, such as the genomic or transcriptomic locations that closely match the read sequence.
 This step is essential for correctly assigning reads to their source regions.
 
 In single-cell sequencing protocols, the raw sequence files typically include:
 
-- Cell {term}`Barcodes` (CB): Unique identifiers for individual cells.
+- Cell {term}`Barcodes <Barcode>` (CB): Unique identifiers for individual cells.
 - Unique Molecular Identifiers (UMIs): Tags that distinguish individual molecules to account for amplification bias.
-- Raw {term}`cDNA` Sequences: The actual read sequences generated from the molecules.
+- Raw {term}`cDNA <Complementary DNA (cDNA)>` Sequences: The actual read sequences generated from the molecules.
 
 As the first step ({numref}`raw-proc-fig-overview`), accurate mapping or alignment is crucial for reliable downstream analyses.
 Errors during this step, such as incorrect mapping of reads to transcripts or genes, can result in inaccurate or misleading count matrices.
@@ -298,7 +298,7 @@ Recent advances, such as wavefront alignment {cite}`marco2021fast`, marco2022opt
 Additionally, much work has focused on optimizing data layout and computation to leverage instruction-level parallelism {cite}`wozniak1997using, rognes2000six, farrar2007striped`, and expressing dynamic programming recurrences in ways that facilitate data parallelism and vectorization, such as through difference encoding {cite:t}`Suzuki2018`.
 Most widely-used alignment tools incorporate these highly optimized, vectorized implementations.
 
-In addition to the alignment score, the {term}`backtrace` of the actual alignment that produces this score is often encoded as a `CIGAR` string (short for "Concise Idiosyncratic Gapped Alignment Report").
+In addition to the alignment score, the backtrace of the actual alignment that produces this score is often encoded as a `CIGAR` string (short for "Concise Idiosyncratic Gapped Alignment Report").
 This alphanumeric representation is typically stored in the SAM or BAM file output.
 For example, the `CIGAR` string `3M2D4M` indicates that the alignment has three matches or mismatches, followed by a deletion of length two (representing bases present in the reference but not the read), and then four more matches or mismatches.
 Extended `CIGAR` strings can provide additional details, such as distinguishing between matches, mismatches, or insertions.
@@ -319,7 +319,7 @@ Alignment-based approaches can be categorized into spliced-alignment and contigu
 
 ```{dropdown} Spliced-alignment methods
 Spliced-alignment methods allow a sequence read to align across multiple distinct segments of a reference, allowing potentially large gaps between aligned regions.
-These approaches are particularly useful for aligning RNA-seq reads to the genome, where reads may span {term}`splice junctions`.
+These approaches are particularly useful for aligning RNA-seq reads to the genome, where reads may span {term}`splice junctions <Splice Junctions>`.
 In such cases, a contiguous sequence in the read may be separated by intron and exon subsequence in the reference, potentially spanning kilobases of sequence.
 Spliced alignment is especially challenging when only a small portion of a read overlaps a splice junction, as limited sequence information is available to accurately place the overhanging segment.
 ```
diff --git a/jupyter-book/introduction/scrna_seq.md b/jupyter-book/introduction/scrna_seq.md
@@ -310,7 +310,7 @@ This allows several hundred cells to be analyzed in a single experiment with 500
 Plate-based sequencing protocols include but are not limited to, SMART-seq2, MARS-seq, QUARTZ-seq, and SRCB-seq. Generally speaking, the protocols differ in their multiplexing ability.
 For example, MARS-seq allows for three barcode levels, namely molecular, cellular, and plate-level tags, for robust multiplexing capabilities. SMART-seq2, on the contrary, does not allow for early multiplexing, limiting cell numbers.
 A systematic comparison of protocols by Mereu et al. in 2020 revealed that QUARTZ-seq2 can capture more genes than SMART-seq2, MARS-seq, or SRCB-seq per cell {cite}`Mereu2020`.
-This means QUARTZ-seq2 can capture cell-type specific marker genes well, allowing for confident cell-type {term}`annotation`.
+This means QUARTZ-seq2 can capture cell-type specific marker genes well, allowing for confident cell-type annotation.
 
 Strengths:
 
@@ -322,7 +322,7 @@ Limitations:
 
 - The scale of plate-based experiments is limited by the lower throughput of their individual processing units.
 - Fragmentation step eliminates strand-specific information {cite}`Hrdlickova2017`.
-- Depending on the protocol, plate-based protocols might be labor-intensive with many required pipetting steps, leading to potential technical noise and {term}`batch effects`.
+- Depending on the protocol, plate-based protocols might be labor-intensive with many required pipetting steps, leading to potential technical noise and batch effects.
 
 #### Fluidigm C1
 
@@ -373,7 +373,7 @@ In this case, one of the plate-based methods may be more suitable.
 On the contrary, droplet-based assays will capture heterogeneous mixtures better, allowing for a broader characterization of the sequenced cells.
 Moreover, if the budget is a limiting factor, the protocol of choice should be more cost-effective and robust.
 When analyzing the data, be aware of the sequencing assay-specific biases.
-For an extensive comparison of all single-cell sequencing protocols, we recommend the "{term}`Benchmarking` single-cell RNA-sequencing protocols for cell atlas projects" paper by Mereu et al. {cite}`Mereu2020`.
+For an extensive comparison of all single-cell sequencing protocols, we recommend the "Benchmarking single-cell RNA-sequencing protocols for cell atlas projects" paper by Mereu et al. {cite}`Mereu2020`.
 
 (introduction-scrna-seq-key-takeaway-5)=