You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: jupyter-book/glossary.md
+7-31Lines changed: 7 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,34 +2,26 @@
2
2
3
3
```{glossary}
4
4
Adapter sequences
5
-
adapter sequences
6
5
Short, synthetic DNA or RNA sequences that are ligated to the ends of DNA or RNA fragments during library preparation for sequencing.
7
6
These adapters are essential for binding the fragments to the flowcell and enabling amplification and sequencing.
8
7
However, if adapters are not trimmed after sequencing, they can appear in the reads, potentially interfering with alignment and downstream analyses.
9
8
10
9
Algorithm
11
-
Algorithms
12
10
A pre-defined set of instructions to solve a problem.
13
11
14
12
AnnData
15
-
AnnDatas
16
13
A Python package for handling annotated data matrices, commonly used in single-cell and other omics analyses.
17
14
It provides an efficient way to store data as a matrix where rows (observations) and columns (features) can have associated metadata.
18
15
[AnnData](https://anndata.readthedocs.io/en/latest/index.html) supports slicing, subsetting, and saving to disk in formats like H5AD and Zarr.
19
16
20
17
BAM
21
-
BAM files
22
18
BAM files are binary, compressed versions of SAM (Sequence Alignment/Map) files that store sequencing read alignments to a reference genome.
23
19
They contain the same information as {term}`SAM` files - including read sequences, quality scores, and alignment positions - but in a more space-efficient format that enables faster processing and reduced storage requirements.
24
20
25
21
Amplification bias
26
22
A distortion that occurs during DNA or RNA amplification (e.g., PCR), where certain sequences are copied more efficiently than others. This can lead to uneven or inaccurate representation of the original genetic material, affecting results in experiments like sequencing or gene expression analysis.
27
23
28
24
Barcode
29
-
Barcodes
30
-
Bar code
31
-
Bar code
32
-
Cell barcode
33
25
Short DNA barcode fragments ("tags") that are used to identify reads originating from the same cell.
34
26
Reads are later grouped by their barcode during raw data processing steps.
35
27
@@ -41,25 +33,22 @@ Benchmark
41
33
An (independent) comparison of performance of several tools with respect to pre-defined metrics.
42
34
43
35
Bulk RNA sequencing
44
-
bulk RNA-Seq
45
-
bulk sequencing
46
36
Contrary to single-cell sequencing, bulk sequencing measures the average expression values of several cells. Therefore, resolution is lost, but bulk sequencing is usually cheaper, less laborious and faster to analyze.
47
37
48
38
Cell
49
-
cells
50
39
The fundamental unit of life, consisting of cytoplasm enclosed within a membrane, containing biomolecules such as proteins and nucleic acids.
51
40
Cells acquire specific functions, transition into different types, divide, and communicate to sustain an organism.
52
41
Studying cell structure, activity, and interactions enables insights into gene expression dynamics, cellular trajectories, developmental lineages, and disease mechanisms.
53
42
54
43
Cell type annotation
55
-
The process of labeling groups of {term}`clusters` of cells by {term}`cell type`.
56
-
Commonly done based on {term}`cell type` specific markers, automatically with classifiers or by mapping against a reference.
44
+
The process of labeling groups of {term}`clusters <Cluster>` of cells by {term}`cell type <Cell type>`.
45
+
Commonly done based on cell type specific markers, automatically with classifiers or by mapping against a reference.
57
46
58
47
Cell type
59
48
Cells that share common morphological or phenotypic features.
60
49
61
50
Cell state
62
-
Cells can be annotated according to {term}`cell type` or other cell states as defined by the cell-cycle, perturbational state or other features.
51
+
Cells can be annotated according to {term}`cell type <Cell type>` or other cell states as defined by the cell-cycle, perturbational state or other features.
63
52
64
53
Chromatin
65
54
The complex of DNA and proteins efficiently packaging the DNA inside the nucleus and involved in regulating gene expression.
@@ -74,17 +63,15 @@ CpG
74
63
Unmethylated CpG sites are associated with gene activation, while methylated CpG sites can lead to gene inhibition.
75
64
76
65
Cluster
77
-
Clusters
78
66
A group of a population or data points that share similarities.
79
-
In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation`).
67
+
In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation <Cell type annotation>`).
80
68
81
69
Complementary DNA (cDNA)
82
-
cDNA
83
70
DNA synthesized from an RNA template by the enzyme reverse transcriptase.
84
71
cDNA is commonly used in RNA-seq library preparation because it is more stable than RNA and allows the captured transcripts to be amplified and sequenced for gene expression analysis.
85
72
86
73
Demultiplexing
87
-
The process of determining which sequencing reads belong to which cell using {term}`barcodes`.
74
+
The process of determining which sequencing reads belong to which cell using {term}`barcodes <Barcode>`.
88
75
89
76
directed graph
90
77
A directed graph (or digraph) is a graph consisting of a set of nodes (vertices) connected by edges (arcs), where each edge has a direction indicating a one-way relationship between nodes.
@@ -98,12 +85,11 @@ Doublets
98
85
Reads obtained from droplet based assays might be mistakenly associated to a single cell while the RNA expression origins from two or more cells (a doublet).
99
86
100
87
Downstream analysis
101
-
downstream analyses
102
88
A phase of data analysis that follows the initial processing of raw data.
103
89
In the context of scRNA-seq, this includes tasks such as normalization, integration, filtering, cell type identification, trajectory inference, and studying expression dynamics.
104
90
105
91
Dropout
106
-
A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type`.
92
+
A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type <Cell type>`.
107
93
The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression.
108
94
Dropouts are one of the reasons why scRNA-seq data is sparse.
109
95
@@ -114,13 +100,11 @@ Edit distance
114
100
Edit distance (often referred to as Levenshtein distance) measures the minimum number of operations (Substitution, Insertion, Deletion) required to transform one string into another.
115
101
116
102
FASTQ
117
-
FASTQ reads
118
103
Sequencing reads that are saved in the FASTQ format.
119
104
A FASTQ file stores DNA/RNA sequences and their corresponding quality scores in a 4-line format: identifier, sequence, optional description, and quality scores encoded in ASCII characters.
120
105
FASTQ files are then used to map against the reference genome of interest to obtain gene counts for cells.
121
106
122
107
Flowcell
123
-
flowcell
124
108
A consumable device used in sequencing platforms where DNA or RNA fragments are sequenced.
125
109
It consists of a glass or polymer surface with lanes or channels coated with oligonucleotides, which capture and anchor DNA or RNA fragments.
126
110
During sequencing, these fragments are amplified into clusters, and their sequences are determined by detecting fluorescent signals emitted during nucleotide incorporation.
@@ -143,14 +127,11 @@ Library
143
127
Also known as sequencing library. A pool of DNA fragments with attached sequencing adapters.
144
128
145
129
Modalities
146
-
Multimodal
147
130
Different types of biological information measured at the single-cell level.
148
131
These include gene expression, chromatin accessibility, surface proteins, immune receptor sequences, and spatial organization.
149
132
Combining these modalities provides a more complete understanding of cell identity, function, and interactions.
150
133
151
134
Locus
152
-
Loci
153
-
loci
154
135
Specific position or region on a genome or transcriptome where a particular sequence or genetic feature is located.
155
136
In sequencing, loci refer to the potential origins of a read or fragment, such as a gene, exon, or intergenic region.
156
137
Accurate identification of loci is critical for mapping reads and understanding the genomic or transcriptomic context of the data.
@@ -160,7 +141,6 @@ MuData
160
141
The primary data structure in the scverse ecosystem for multimodal data.
161
142
162
143
Muon
163
-
muon
164
144
A Python package for multi-modal single-cell analysis in Python by scverse.
165
145
166
146
Negative binomial distribution
@@ -198,7 +178,6 @@ RT-qPCR
198
178
Quantitative reverse transcription {term}`PCR` (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR.
199
179
200
180
SAM
201
-
SAM files
202
181
SAM (Sequence Alignment/Map) files are tab-delimited text files that store sequencing alignment data, showing how sequencing reads map to a reference genome.
203
182
Each line in a SAM file contains information about a single read alignment, including the read sequence, base quality scores, mapping position, and mapping quality.
204
183
@@ -219,7 +198,6 @@ Spike-in RNA
219
198
RNA transcripts of known sequence and quantity to calibrate measurements in RNA hybridization steps for RNA-seq.
220
199
221
200
Splice Junctions
222
-
splice junctions
223
201
Locations where introns are removed, and exons are joined together in a mature RNA transcript during RNA splicing.
224
202
These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional mRNA.
225
203
@@ -228,12 +206,10 @@ Trajectory inference
228
206
The computational recovery of dynamic processes by ordering cells by similarity or other means.
229
207
230
208
Unique Molecular Identifier (UMI)
231
-
UMI
232
209
A special type of molecular barcode that uniquely tags each molecule in a sample library.
233
-
This, for example, enables the estimation of PCR duplication rates (see {term}`amplification bias`), which leads to error correction and increases accuracy.
210
+
This, for example, enables the estimation of PCR duplication rates (see {term}`amplification bias <Amplification bias>`), which leads to error correction and increases accuracy.
234
211
235
212
Untranslated Region (UTR)
236
-
UTR
237
213
A segment of an mRNA transcript that is transcribed but not translated into protein.
238
214
UTRs are located at both ends of the coding sequence.
Copy file name to clipboardExpand all lines: jupyter-book/introduction/raw_data_processing.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,15 +16,15 @@ An overview of the topics discussed in this chapter. In the plot, "txome" stands
16
16
:::
17
17
18
18
The count matrix is the foundation for a wide range of scRNA-seq analyses {cite}`Zappia2021_raw`, including cell type identification or developmental trajectory inference.
19
-
A robust and accurate count matrix is essential for reliable {term}`downstream analyses`.
19
+
A robust and accurate count matrix is essential for reliable {term}`downstream analyses <Downstream analysis>`.
20
20
Errors at this stage can lead to invalid conclusions and discoveries based on missed insights, or distorted signals in the data.
21
21
Despite the straightforward nature of the input (FASTQ files) and the desired output (count matrix), raw data processing presents several technical challenges.
22
22
23
23
In this section, we focus on key steps of raw data processing:
24
24
25
25
1. Read alignment/mapping
26
26
2. Cell barcode (CB) identification and correction
27
-
3. Estimation of molecule counts through {term}`unique molecular identifiers (UMIs)`
27
+
3. Estimation of molecule counts through {term}`unique molecular identifiers (UMIs) <Unique Molecular Identifier (UMI)>`
28
28
29
29
We also discuss the challenges and trade-offs involved in each step.
30
30
@@ -103,7 +103,7 @@ A good (left) and a bad (right) per-read sequence quality graph.
103
103
104
104
**3. Per tile sequence quality**
105
105
106
-
Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each {term}`flowcell` [tile](https://www.biostars.org/p/9461090/)(miniature imaging areas of the {term}`flowcell`).
106
+
Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each {term}` <Flowcell>` [tile](https://www.biostars.org/p/9461090/)(miniature imaging areas of the {term}`flowcell <Flowcell>`).
107
107
The plot uses a color gradient to represent deviations, where warmer colors indicate larger deviations.
108
108
High-quality data typically display a uniform blue color across the plot, indicating consistent quality across all tiles of the flowcell.
109
109
@@ -221,7 +221,7 @@ An overrepresented sequence table.
221
221
222
222
**11. Adapter content**
223
223
224
-
The adapter content module displays the cumulative percentage of reads containing {term}`adapter sequences` at each base position.
224
+
The adapter content module displays the cumulative percentage of reads containing {term}`adapter sequences <Adapter sequences>` at each base position.
225
225
High levels of adapter sequences indicate incomplete removal of adapters during library preparation, which can interfere with downstream analyses.
226
226
Ideally, no significant adapter content should be present in the data.
227
227
If adapter sequences are abundant, additional trimming may be necessary to improve data quality.
@@ -241,14 +241,14 @@ Multiple FastQC reports can be combined into a single report using the tool [`Mu
241
241
## Alignment and mapping
242
242
243
243
Mapping or Alignment is a critical step in single-cell raw data processing.
244
-
It involves determining the potential {term}`loci` of origin for each sequenced fragment, such as the genomic or transcriptomic locations that closely match the read sequence.
244
+
It involves determining the potential {term}`loci <Locus>` of origin for each sequenced fragment, such as the genomic or transcriptomic locations that closely match the read sequence.
245
245
This step is essential for correctly assigning reads to their source regions.
246
246
247
247
In single-cell sequencing protocols, the raw sequence files typically include:
248
248
249
-
- Cell {term}`Barcodes` (CB): Unique identifiers for individual cells.
249
+
- Cell {term}`Barcodes <Barcode>` (CB): Unique identifiers for individual cells.
250
250
- Unique Molecular Identifiers (UMIs): Tags that distinguish individual molecules to account for amplification bias.
251
-
- Raw {term}`cDNA` Sequences: The actual read sequences generated from the molecules.
251
+
- Raw {term}`cDNA <Complementary DNA (cDNA)>` Sequences: The actual read sequences generated from the molecules.
252
252
253
253
As the first step ({numref}`raw-proc-fig-overview`), accurate mapping or alignment is crucial for reliable downstream analyses.
254
254
Errors during this step, such as incorrect mapping of reads to transcripts or genes, can result in inaccurate or misleading count matrices.
@@ -298,7 +298,7 @@ Recent advances, such as wavefront alignment {cite}`marco2021fast`, marco2022opt
298
298
Additionally, much work has focused on optimizing data layout and computation to leverage instruction-level parallelism {cite}`wozniak1997using, rognes2000six, farrar2007striped`, and expressing dynamic programming recurrences in ways that facilitate data parallelism and vectorization, such as through difference encoding {cite:t}`Suzuki2018`.
299
299
Most widely-used alignment tools incorporate these highly optimized, vectorized implementations.
300
300
301
-
In addition to the alignment score, the {term}`backtrace` of the actual alignment that produces this score is often encoded as a `CIGAR` string (short for "Concise Idiosyncratic Gapped Alignment Report").
301
+
In addition to the alignment score, the backtrace of the actual alignment that produces this score is often encoded as a `CIGAR` string (short for "Concise Idiosyncratic Gapped Alignment Report").
302
302
This alphanumeric representation is typically stored in the SAM or BAM file output.
303
303
For example, the `CIGAR` string `3M2D4M` indicates that the alignment has three matches or mismatches, followed by a deletion of length two (representing bases present in the reference but not the read), and then four more matches or mismatches.
304
304
Extended `CIGAR` strings can provide additional details, such as distinguishing between matches, mismatches, or insertions.
@@ -319,7 +319,7 @@ Alignment-based approaches can be categorized into spliced-alignment and contigu
319
319
320
320
```{dropdown} Spliced-alignment methods
321
321
Spliced-alignment methods allow a sequence read to align across multiple distinct segments of a reference, allowing potentially large gaps between aligned regions.
322
-
These approaches are particularly useful for aligning RNA-seq reads to the genome, where reads may span {term}`splice junctions`.
322
+
These approaches are particularly useful for aligning RNA-seq reads to the genome, where reads may span {term}`splice junctions <Splice Junctions>`.
323
323
In such cases, a contiguous sequence in the read may be separated by intron and exon subsequence in the reference, potentially spanning kilobases of sequence.
324
324
Spliced alignment is especially challenging when only a small portion of a read overlaps a splice junction, as limited sequence information is available to accurately place the overhanging segment.
Copy file name to clipboardExpand all lines: jupyter-book/introduction/scrna_seq.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -310,7 +310,7 @@ This allows several hundred cells to be analyzed in a single experiment with 500
310
310
Plate-based sequencing protocols include but are not limited to, SMART-seq2, MARS-seq, QUARTZ-seq, and SRCB-seq. Generally speaking, the protocols differ in their multiplexing ability.
311
311
For example, MARS-seq allows for three barcode levels, namely molecular, cellular, and plate-level tags, for robust multiplexing capabilities. SMART-seq2, on the contrary, does not allow for early multiplexing, limiting cell numbers.
312
312
A systematic comparison of protocols by Mereu et al. in 2020 revealed that QUARTZ-seq2 can capture more genes than SMART-seq2, MARS-seq, or SRCB-seq per cell {cite}`Mereu2020`.
313
-
This means QUARTZ-seq2 can capture cell-type specific marker genes well, allowing for confident cell-type {term}`annotation`.
313
+
This means QUARTZ-seq2 can capture cell-type specific marker genes well, allowing for confident cell-type annotation.
314
314
315
315
Strengths:
316
316
@@ -322,7 +322,7 @@ Limitations:
322
322
323
323
- The scale of plate-based experiments is limited by the lower throughput of their individual processing units.
324
324
- Fragmentation step eliminates strand-specific information {cite}`Hrdlickova2017`.
325
-
- Depending on the protocol, plate-based protocols might be labor-intensive with many required pipetting steps, leading to potential technical noise and {term}`batch effects`.
325
+
- Depending on the protocol, plate-based protocols might be labor-intensive with many required pipetting steps, leading to potential technical noise and batch effects.
326
326
327
327
#### Fluidigm C1
328
328
@@ -373,7 +373,7 @@ In this case, one of the plate-based methods may be more suitable.
373
373
On the contrary, droplet-based assays will capture heterogeneous mixtures better, allowing for a broader characterization of the sequenced cells.
374
374
Moreover, if the budget is a limiting factor, the protocol of choice should be more cost-effective and robust.
375
375
When analyzing the data, be aware of the sequencing assay-specific biases.
376
-
For an extensive comparison of all single-cell sequencing protocols, we recommend the "{term}`Benchmarking` single-cell RNA-sequencing protocols for cell atlas projects" paper by Mereu et al. {cite}`Mereu2020`.
376
+
For an extensive comparison of all single-cell sequencing protocols, we recommend the "Benchmarking single-cell RNA-sequencing protocols for cell atlas projects" paper by Mereu et al. {cite}`Mereu2020`.
0 commit comments