Skip to content

Commit 76b904c

Browse files
authored
Fix missed glossary links in scRNA sequencing and raw data processing (#361)
* fix glossary links in raw data processing * fix glossary links in scrna seq * fix glossary links in glossary * remove multiple entries in glossary
1 parent 00f2d9a commit 76b904c

File tree

3 files changed

+19
-43
lines changed

3 files changed

+19
-43
lines changed

jupyter-book/glossary.md

Lines changed: 7 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,34 +2,26 @@
22

33
```{glossary}
44
Adapter sequences
5-
adapter sequences
65
Short, synthetic DNA or RNA sequences that are ligated to the ends of DNA or RNA fragments during library preparation for sequencing.
76
These adapters are essential for binding the fragments to the flowcell and enabling amplification and sequencing.
87
However, if adapters are not trimmed after sequencing, they can appear in the reads, potentially interfering with alignment and downstream analyses.
98
109
Algorithm
11-
Algorithms
1210
A pre-defined set of instructions to solve a problem.
1311
1412
AnnData
15-
AnnDatas
1613
A Python package for handling annotated data matrices, commonly used in single-cell and other omics analyses.
1714
It provides an efficient way to store data as a matrix where rows (observations) and columns (features) can have associated metadata.
1815
[AnnData](https://anndata.readthedocs.io/en/latest/index.html) supports slicing, subsetting, and saving to disk in formats like H5AD and Zarr.
1916
2017
BAM
21-
BAM files
2218
BAM files are binary, compressed versions of SAM (Sequence Alignment/Map) files that store sequencing read alignments to a reference genome.
2319
They contain the same information as {term}`SAM` files - including read sequences, quality scores, and alignment positions - but in a more space-efficient format that enables faster processing and reduced storage requirements.
2420
2521
Amplification bias
2622
A distortion that occurs during DNA or RNA amplification (e.g., PCR), where certain sequences are copied more efficiently than others. This can lead to uneven or inaccurate representation of the original genetic material, affecting results in experiments like sequencing or gene expression analysis.
2723
2824
Barcode
29-
Barcodes
30-
Bar code
31-
Bar code
32-
Cell barcode
3325
Short DNA barcode fragments ("tags") that are used to identify reads originating from the same cell.
3426
Reads are later grouped by their barcode during raw data processing steps.
3527
@@ -41,25 +33,22 @@ Benchmark
4133
An (independent) comparison of performance of several tools with respect to pre-defined metrics.
4234
4335
Bulk RNA sequencing
44-
bulk RNA-Seq
45-
bulk sequencing
4636
Contrary to single-cell sequencing, bulk sequencing measures the average expression values of several cells. Therefore, resolution is lost, but bulk sequencing is usually cheaper, less laborious and faster to analyze.
4737
4838
Cell
49-
cells
5039
The fundamental unit of life, consisting of cytoplasm enclosed within a membrane, containing biomolecules such as proteins and nucleic acids.
5140
Cells acquire specific functions, transition into different types, divide, and communicate to sustain an organism.
5241
Studying cell structure, activity, and interactions enables insights into gene expression dynamics, cellular trajectories, developmental lineages, and disease mechanisms.
5342
5443
Cell type annotation
55-
The process of labeling groups of {term}`clusters` of cells by {term}`cell type`.
56-
Commonly done based on {term}`cell type` specific markers, automatically with classifiers or by mapping against a reference.
44+
The process of labeling groups of {term}`clusters <Cluster>` of cells by {term}`cell type <Cell type>`.
45+
Commonly done based on cell type specific markers, automatically with classifiers or by mapping against a reference.
5746
5847
Cell type
5948
Cells that share common morphological or phenotypic features.
6049
6150
Cell state
62-
Cells can be annotated according to {term}`cell type` or other cell states as defined by the cell-cycle, perturbational state or other features.
51+
Cells can be annotated according to {term}`cell type <Cell type>` or other cell states as defined by the cell-cycle, perturbational state or other features.
6352
6453
Chromatin
6554
The complex of DNA and proteins efficiently packaging the DNA inside the nucleus and involved in regulating gene expression.
@@ -74,17 +63,15 @@ CpG
7463
Unmethylated CpG sites are associated with gene activation, while methylated CpG sites can lead to gene inhibition.
7564
7665
Cluster
77-
Clusters
7866
A group of a population or data points that share similarities.
79-
In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation`).
67+
In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation <Cell type annotation>`).
8068
8169
Complementary DNA (cDNA)
82-
cDNA
8370
DNA synthesized from an RNA template by the enzyme reverse transcriptase.
8471
cDNA is commonly used in RNA-seq library preparation because it is more stable than RNA and allows the captured transcripts to be amplified and sequenced for gene expression analysis.
8572
8673
Demultiplexing
87-
The process of determining which sequencing reads belong to which cell using {term}`barcodes`.
74+
The process of determining which sequencing reads belong to which cell using {term}`barcodes <Barcode>`.
8875
8976
directed graph
9077
A directed graph (or digraph) is a graph consisting of a set of nodes (vertices) connected by edges (arcs), where each edge has a direction indicating a one-way relationship between nodes.
@@ -98,12 +85,11 @@ Doublets
9885
Reads obtained from droplet based assays might be mistakenly associated to a single cell while the RNA expression origins from two or more cells (a doublet).
9986
10087
Downstream analysis
101-
downstream analyses
10288
A phase of data analysis that follows the initial processing of raw data.
10389
In the context of scRNA-seq, this includes tasks such as normalization, integration, filtering, cell type identification, trajectory inference, and studying expression dynamics.
10490
10591
Dropout
106-
A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type`.
92+
A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type <Cell type>`.
10793
The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression.
10894
Dropouts are one of the reasons why scRNA-seq data is sparse.
10995
@@ -114,13 +100,11 @@ Edit distance
114100
Edit distance (often referred to as Levenshtein distance) measures the minimum number of operations (Substitution, Insertion, Deletion) required to transform one string into another.
115101
116102
FASTQ
117-
FASTQ reads
118103
Sequencing reads that are saved in the FASTQ format.
119104
A FASTQ file stores DNA/RNA sequences and their corresponding quality scores in a 4-line format: identifier, sequence, optional description, and quality scores encoded in ASCII characters.
120105
FASTQ files are then used to map against the reference genome of interest to obtain gene counts for cells.
121106
122107
Flowcell
123-
flowcell
124108
A consumable device used in sequencing platforms where DNA or RNA fragments are sequenced.
125109
It consists of a glass or polymer surface with lanes or channels coated with oligonucleotides, which capture and anchor DNA or RNA fragments.
126110
During sequencing, these fragments are amplified into clusters, and their sequences are determined by detecting fluorescent signals emitted during nucleotide incorporation.
@@ -143,14 +127,11 @@ Library
143127
Also known as sequencing library. A pool of DNA fragments with attached sequencing adapters.
144128
145129
Modalities
146-
Multimodal
147130
Different types of biological information measured at the single-cell level.
148131
These include gene expression, chromatin accessibility, surface proteins, immune receptor sequences, and spatial organization.
149132
Combining these modalities provides a more complete understanding of cell identity, function, and interactions.
150133
151134
Locus
152-
Loci
153-
loci
154135
Specific position or region on a genome or transcriptome where a particular sequence or genetic feature is located.
155136
In sequencing, loci refer to the potential origins of a read or fragment, such as a gene, exon, or intergenic region.
156137
Accurate identification of loci is critical for mapping reads and understanding the genomic or transcriptomic context of the data.
@@ -160,7 +141,6 @@ MuData
160141
The primary data structure in the scverse ecosystem for multimodal data.
161142
162143
Muon
163-
muon
164144
A Python package for multi-modal single-cell analysis in Python by scverse.
165145
166146
Negative binomial distribution
@@ -198,7 +178,6 @@ RT-qPCR
198178
Quantitative reverse transcription {term}`PCR` (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR.
199179
200180
SAM
201-
SAM files
202181
SAM (Sequence Alignment/Map) files are tab-delimited text files that store sequencing alignment data, showing how sequencing reads map to a reference genome.
203182
Each line in a SAM file contains information about a single read alignment, including the read sequence, base quality scores, mapping position, and mapping quality.
204183
@@ -219,7 +198,6 @@ Spike-in RNA
219198
RNA transcripts of known sequence and quantity to calibrate measurements in RNA hybridization steps for RNA-seq.
220199
221200
Splice Junctions
222-
splice junctions
223201
Locations where introns are removed, and exons are joined together in a mature RNA transcript during RNA splicing.
224202
These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional mRNA.
225203
@@ -228,12 +206,10 @@ Trajectory inference
228206
The computational recovery of dynamic processes by ordering cells by similarity or other means.
229207
230208
Unique Molecular Identifier (UMI)
231-
UMI
232209
A special type of molecular barcode that uniquely tags each molecule in a sample library.
233-
This, for example, enables the estimation of PCR duplication rates (see {term}`amplification bias`), which leads to error correction and increases accuracy.
210+
This, for example, enables the estimation of PCR duplication rates (see {term}`amplification bias <Amplification bias>`), which leads to error correction and increases accuracy.
234211
235212
Untranslated Region (UTR)
236-
UTR
237213
A segment of an mRNA transcript that is transcribed but not translated into protein.
238214
UTRs are located at both ends of the coding sequence.
239215
```

jupyter-book/introduction/raw_data_processing.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,15 +16,15 @@ An overview of the topics discussed in this chapter. In the plot, "txome" stands
1616
:::
1717

1818
The count matrix is the foundation for a wide range of scRNA-seq analyses {cite}`Zappia2021_raw`, including cell type identification or developmental trajectory inference.
19-
A robust and accurate count matrix is essential for reliable {term}`downstream analyses`.
19+
A robust and accurate count matrix is essential for reliable {term}`downstream analyses <Downstream analysis>`.
2020
Errors at this stage can lead to invalid conclusions and discoveries based on missed insights, or distorted signals in the data.
2121
Despite the straightforward nature of the input (FASTQ files) and the desired output (count matrix), raw data processing presents several technical challenges.
2222

2323
In this section, we focus on key steps of raw data processing:
2424

2525
1. Read alignment/mapping
2626
2. Cell barcode (CB) identification and correction
27-
3. Estimation of molecule counts through {term}`unique molecular identifiers (UMIs)`
27+
3. Estimation of molecule counts through {term}`unique molecular identifiers (UMIs) <Unique Molecular Identifier (UMI)>`
2828

2929
We also discuss the challenges and trade-offs involved in each step.
3030

@@ -103,7 +103,7 @@ A good (left) and a bad (right) per-read sequence quality graph.
103103
104104
**3. Per tile sequence quality**
105105
106-
Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each {term}`flowcell` [tile](https://www.biostars.org/p/9461090/)(miniature imaging areas of the {term}`flowcell`).
106+
Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each {term}` <Flowcell>` [tile](https://www.biostars.org/p/9461090/)(miniature imaging areas of the {term}`flowcell <Flowcell>`).
107107
The plot uses a color gradient to represent deviations, where warmer colors indicate larger deviations.
108108
High-quality data typically display a uniform blue color across the plot, indicating consistent quality across all tiles of the flowcell.
109109
@@ -221,7 +221,7 @@ An overrepresented sequence table.
221221
222222
**11. Adapter content**
223223
224-
The adapter content module displays the cumulative percentage of reads containing {term}`adapter sequences` at each base position.
224+
The adapter content module displays the cumulative percentage of reads containing {term}`adapter sequences <Adapter sequences>` at each base position.
225225
High levels of adapter sequences indicate incomplete removal of adapters during library preparation, which can interfere with downstream analyses.
226226
Ideally, no significant adapter content should be present in the data.
227227
If adapter sequences are abundant, additional trimming may be necessary to improve data quality.
@@ -241,14 +241,14 @@ Multiple FastQC reports can be combined into a single report using the tool [`Mu
241241
## Alignment and mapping
242242

243243
Mapping or Alignment is a critical step in single-cell raw data processing.
244-
It involves determining the potential {term}`loci` of origin for each sequenced fragment, such as the genomic or transcriptomic locations that closely match the read sequence.
244+
It involves determining the potential {term}`loci <Locus>` of origin for each sequenced fragment, such as the genomic or transcriptomic locations that closely match the read sequence.
245245
This step is essential for correctly assigning reads to their source regions.
246246

247247
In single-cell sequencing protocols, the raw sequence files typically include:
248248

249-
- Cell {term}`Barcodes` (CB): Unique identifiers for individual cells.
249+
- Cell {term}`Barcodes <Barcode>` (CB): Unique identifiers for individual cells.
250250
- Unique Molecular Identifiers (UMIs): Tags that distinguish individual molecules to account for amplification bias.
251-
- Raw {term}`cDNA` Sequences: The actual read sequences generated from the molecules.
251+
- Raw {term}`cDNA <Complementary DNA (cDNA)>` Sequences: The actual read sequences generated from the molecules.
252252

253253
As the first step ({numref}`raw-proc-fig-overview`), accurate mapping or alignment is crucial for reliable downstream analyses.
254254
Errors during this step, such as incorrect mapping of reads to transcripts or genes, can result in inaccurate or misleading count matrices.
@@ -298,7 +298,7 @@ Recent advances, such as wavefront alignment {cite}`marco2021fast`, marco2022opt
298298
Additionally, much work has focused on optimizing data layout and computation to leverage instruction-level parallelism {cite}`wozniak1997using, rognes2000six, farrar2007striped`, and expressing dynamic programming recurrences in ways that facilitate data parallelism and vectorization, such as through difference encoding {cite:t}`Suzuki2018`.
299299
Most widely-used alignment tools incorporate these highly optimized, vectorized implementations.
300300

301-
In addition to the alignment score, the {term}`backtrace` of the actual alignment that produces this score is often encoded as a `CIGAR` string (short for "Concise Idiosyncratic Gapped Alignment Report").
301+
In addition to the alignment score, the backtrace of the actual alignment that produces this score is often encoded as a `CIGAR` string (short for "Concise Idiosyncratic Gapped Alignment Report").
302302
This alphanumeric representation is typically stored in the SAM or BAM file output.
303303
For example, the `CIGAR` string `3M2D4M` indicates that the alignment has three matches or mismatches, followed by a deletion of length two (representing bases present in the reference but not the read), and then four more matches or mismatches.
304304
Extended `CIGAR` strings can provide additional details, such as distinguishing between matches, mismatches, or insertions.
@@ -319,7 +319,7 @@ Alignment-based approaches can be categorized into spliced-alignment and contigu
319319

320320
```{dropdown} Spliced-alignment methods
321321
Spliced-alignment methods allow a sequence read to align across multiple distinct segments of a reference, allowing potentially large gaps between aligned regions.
322-
These approaches are particularly useful for aligning RNA-seq reads to the genome, where reads may span {term}`splice junctions`.
322+
These approaches are particularly useful for aligning RNA-seq reads to the genome, where reads may span {term}`splice junctions <Splice Junctions>`.
323323
In such cases, a contiguous sequence in the read may be separated by intron and exon subsequence in the reference, potentially spanning kilobases of sequence.
324324
Spliced alignment is especially challenging when only a small portion of a read overlaps a splice junction, as limited sequence information is available to accurately place the overhanging segment.
325325
```

jupyter-book/introduction/scrna_seq.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -310,7 +310,7 @@ This allows several hundred cells to be analyzed in a single experiment with 500
310310
Plate-based sequencing protocols include but are not limited to, SMART-seq2, MARS-seq, QUARTZ-seq, and SRCB-seq. Generally speaking, the protocols differ in their multiplexing ability.
311311
For example, MARS-seq allows for three barcode levels, namely molecular, cellular, and plate-level tags, for robust multiplexing capabilities. SMART-seq2, on the contrary, does not allow for early multiplexing, limiting cell numbers.
312312
A systematic comparison of protocols by Mereu et al. in 2020 revealed that QUARTZ-seq2 can capture more genes than SMART-seq2, MARS-seq, or SRCB-seq per cell {cite}`Mereu2020`.
313-
This means QUARTZ-seq2 can capture cell-type specific marker genes well, allowing for confident cell-type {term}`annotation`.
313+
This means QUARTZ-seq2 can capture cell-type specific marker genes well, allowing for confident cell-type annotation.
314314

315315
Strengths:
316316

@@ -322,7 +322,7 @@ Limitations:
322322

323323
- The scale of plate-based experiments is limited by the lower throughput of their individual processing units.
324324
- Fragmentation step eliminates strand-specific information {cite}`Hrdlickova2017`.
325-
- Depending on the protocol, plate-based protocols might be labor-intensive with many required pipetting steps, leading to potential technical noise and {term}`batch effects`.
325+
- Depending on the protocol, plate-based protocols might be labor-intensive with many required pipetting steps, leading to potential technical noise and batch effects.
326326

327327
#### Fluidigm C1
328328

@@ -373,7 +373,7 @@ In this case, one of the plate-based methods may be more suitable.
373373
On the contrary, droplet-based assays will capture heterogeneous mixtures better, allowing for a broader characterization of the sequenced cells.
374374
Moreover, if the budget is a limiting factor, the protocol of choice should be more cost-effective and robust.
375375
When analyzing the data, be aware of the sequencing assay-specific biases.
376-
For an extensive comparison of all single-cell sequencing protocols, we recommend the "{term}`Benchmarking` single-cell RNA-sequencing protocols for cell atlas projects" paper by Mereu et al. {cite}`Mereu2020`.
376+
For an extensive comparison of all single-cell sequencing protocols, we recommend the "Benchmarking single-cell RNA-sequencing protocols for cell atlas projects" paper by Mereu et al. {cite}`Mereu2020`.
377377

378378
(introduction-scrna-seq-key-takeaway-5)=
379379

0 commit comments

Comments
 (0)