You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update scRNA-seq chapter and add a paragraph on quantification of gene expression (#371)
* add dropdowns for Strength and limitations
* shorten the first paragraph
* links for new developments in sequencing
* some notes for new developments in scRNA-Seq
* notes for new developments
* bullet points
* first fact for bio paragraph
* first text for bio chapter
* ideas and first sentences for bio-chapter
* continue writing the text
* safe process
* the current state of my ideas and the rough structure of the paragraph
* adding text from splicing to protein/RNA ratio
* missing changes to .bib
* before the finalization of the paragraph
* improve flow and wording of 'central dogma in numbers'
* save current state of table progress
* change to version pre-commit/action@v3.0.0 and insert new chapter to flow of the text
* fix pdfhtml error
* restructure scRNA-Seq protocols
* add PIP-seq dropdown
* add some more sentences and the questions
* add figure for central dogma in numbers
* correctly link batch effects
* update key takeaways
* first improvements
* add a video reference and new terms to glossary
* properties of negative binomial distribution
* remove central dogma and improve figure
* link seealso and new terms in the other chapters
* add url and doi to bibliography
* add a few new lines
* transform to .ipynb
* transform to .ipynb
* small typo in .png
* remove .key files
* make key terms bold in key takeaways
* add batch effect to chapters
* add changelog
* add link to next chapter
---------
Co-authored-by: Luis <ge34lah@mytum.de>
Update scRNA-seq chapter and add a paragraph on quantification of gene expression ([#371](https://github.com/theislab/single-cell-best-practices/pull/371)) <sub>@LuisHeinzlmeier</sub>
Copy file name to clipboardExpand all lines: jupyter-book/cellular_structure/annotation.ipynb
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,7 @@
19
19
"cell_type": "markdown",
20
20
"metadata": {},
21
21
"source": [
22
-
"To understand your data better and make use of existing knowledge, it is important to figure out the \"cellular identity\" of each of the cells in your data. The process of labeling groups of cells in your data based on known (or sometimes unknown) cellular phenotypes is called \"cell annotation\". Whereas there are many ways to annotate your cells (e.g. based on batch, disease, sex and more), in this notebook we will focus on the annotation of \"cell types\".<br>\n",
22
+
"To understand your data better and make use of existing knowledge, it is important to figure out the \"cellular identity\" of each of the cells in your data. The process of labeling groups of cells in your data based on known (or sometimes unknown) cellular phenotypes is called \"cell annotation\". Whereas there are many ways to annotate your cells (e.g. based on {term}`batch <batch effect>`, disease, sex and more), in this notebook we will focus on the annotation of \"cell types\".<br>\n",
23
23
"So what is a cell type? Biologists use the term cell type to denote a cellular phenotype that is robust across datasets, identifiable based on expression of specific markers (i.e. proteins or gene transcripts), and often linked to specific functions. For example, a plasma B cell is a type of white blood cell that secretes antibodies used to fight pathogens and it can be identified using specific markers. Knowing which cell types are in your sample is essential in understanding your data. For example, knowing that there are specific immune cell types in a tumor or unusual hematopoietic stem cells in your bone marrow sample can be a valuable insight into the disease you might be studying.<br>\n",
24
24
"However, like with any categorization the size of categories and the borders drawn between them are partly subjective and can change over time, e.g. because new technologies allow for a higher resolution view of cells, or because specific \"sub-phenotypes\" that were not considered biologically meaningful are found to have important biological implications (see e.g. {cite}`anno:KadurLakshminarasimhaMurthy2022`). Cell types are therefore often further classified into \"subtypes\" or \"cell states\" (e.g. activated versus resting) and some researchers use the term \"cell identity\" to avoid this sometimes arbitrary distinction of cell types, cell subtypes and cell states. For a more detailed discussion of this topic, we recommend the review by Wagner et al. {cite}`anno:Wagner2016` and the recently published review by Zeng {cite}`anno:ZENG20222739`.<br>\n",
25
25
"Similarly, multiple cell types can be part of a single continuum, where one cell type might transition or differentiate into another. For example, in hematopoiesis cells differentiate from a stem cell into a specific immune cell type. Although hard borders between early and late stages of this differentiation are often drawn, the state of these cells can more accurately be described by the differentiation coordinate between the less and more differentiated cellular phenotypes. We will discuss differentiation and cellular trajectories in subsequent chapters.<br>\n",
@@ -807,7 +807,7 @@
807
807
"cell_type": "markdown",
808
808
"metadata": {},
809
809
"source": [
810
-
"You might notice that the annotation of B1 B cells is difficult, with none of the clusters expressing all the B1 B markers and several clusters expressing some of the markers. We often see that markers that work for one dataset do not work as well for others. This can be due to differences in sequencing depth, but also due to other sources of variation between datasets or samples. "
810
+
"You might notice that the annotation of B1 B cells is difficult, with none of the clusters expressing all the B1 B markers and several clusters expressing some of the markers. We often see that markers that work for one dataset do not work as well for others. This can be due to differences in {term}`sequencing` depth, but also due to other sources of variation between datasets or samples. "
Copy file name to clipboardExpand all lines: jupyter-book/cellular_structure/clustering.ipynb
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@
24
24
"id": "99163c2d",
25
25
"metadata": {},
26
26
"source": [
27
-
"Preprocessing and visualization enabled us to describe our scRNA-seq dataset and reduce its dimensionality. Up to this point, we embedded and visualized cells to understand the underlying properties of our dataset. However, they are still rather abstractly defined. The next natural step in single-cell analysis is the identification of cellular structure in the dataset. \n",
27
+
"Preprocessing and visualization enabled us to describe our scRNA-{term}`seq <sequencing>` dataset and reduce its dimensionality. Up to this point, we embedded and visualized cells to understand the underlying properties of our dataset. However, they are still rather abstractly defined. The next natural step in single-cell analysis is the identification of cellular structure in the dataset. \n",
28
28
"\n",
29
29
"In scRNA-seq data analysis, we describe cellular structure in our dataset with finding cell identities that relate to known cell states or cell cycle stages. This process is usually called cell identity annotation. For this purpose, we structure cells into clusters to infer the identity of similar cells. Clustering itself is a common unsupervised machine learning problem. \n",
30
30
"We can derive clusters by minimizing the intra-cluster distance in the reduced expression space. In this case, the expression space determines the gene expression similarity of cells with respect to a dimensionality-reduced representation. This lower dimensional representation is, for example, determined with a principal-component analysis and the similarity scoring is then based on Euclidean distances. \n",
"A central challenge in most scRNA-seq data analyses is presented by batch effects. Batch effects are changes in measured expression levels that are the result of handling cells in distinct groups or “batches”. For example, a batch effect can arise if two labs have taken samples from the same cohort, but these samples are dissociated differently. If Lab A optimizes its dissociation protocol to dissociate cells in the sample while minimizing the stress on them, and Lab B does not, then it is likely that the cells in the data from the group B will express more stress-linked genes (JUN, JUNB, FOS, etc. see {cite}`Van_den_Brink2017-si`) even if the cells had the same profile in the original tissue. In general, the origins of batch effects are diverse and difficult to pin down. Some batch effect sources might be technical such as differences in sample handling, experimental protocols, or sequencing depths, but biological effects such as donor variation, tissue, or sampling location are also often interpreted as a batch effect {cite}`Luecken2021-jo`. Whether or not biological factors should be considered batch effects can depend on the experimental design and the question being asked. Removing batch effects is crucial to enable joint analysis that can focus on finding common structure in the data across batches and enable us to perform queries across datasets. Often it is only after removing these effects that rare cell populations can be identified that were previously obscured by differences between batches. Enabling queries across datasets allows us to ask questions that could not be answered by analysing individual datasets, such as _Which cell types express SARS-CoV-2 entry factors and how does this expression differ between individuals?_ {cite}`Muus2021-ti`."
20
+
"A central challenge in most scRNA-seq data analyses is presented by batch effects. Batch effects are changes in measured expression levels that are the result of handling cells in distinct groups or “batches”. For example, a batch effect can arise if two labs have taken samples from the same cohort, but these samples are dissociated differently. If Lab A optimizes its dissociation protocol to dissociate cells in the sample while minimizing the stress on them, and Lab B does not, then it is likely that the cells in the data from the group B will express more stress-linked genes (JUN, JUNB, FOS, etc. see {cite}`Van_den_Brink2017-si`) even if the cells had the same profile in the original tissue. In general, the origins of batch effects are diverse and difficult to pin down. Some batch effect sources might be technical such as differences in sample handling, experimental protocols, or {term}`sequencing` depths, but biological effects such as donor variation, tissue, or sampling location are also often interpreted as a batch effect {cite}`Luecken2021-jo`. Whether or not biological factors should be considered batch effects can depend on the experimental design and the question being asked. Removing batch effects is crucial to enable joint analysis that can focus on finding common structure in the data across batches and enable us to perform queries across datasets. Often it is only after removing these effects that rare cell populations can be identified that were previously obscured by differences between batches. Enabling queries across datasets allows us to ask questions that could not be answered by analysing individual datasets, such as _Which cell types express SARS-CoV-2 entry factors and how does this expression differ between individuals?_ {cite}`Muus2021-ti`."
"Every cell of an organism shares the same DNA with the same set of functional units referred to as genes. With this in mind, what determines the tremendous diversity of cells reaching from natural killer cells of the immune system to neurons transmitting electrochemical signals throughout the body? In the previous chapters, we saw that cell identity and function can be inferred from gene expression profiles in each cell. The control of gene expression is driven by a complex interplay of regulatory mechanisms such as DNA methylation, histone modifications, and transcription factor activity. {term}`Chromatin` accessibility largely reflects the combined regulatory state of a cell, serving as an orthogonal layer of information to mRNA levels describing cell identity. Furthermore, exploring the chromatin accessibility profile enables additional insights into gene regulatory mechanisms and cell differentiation processes that might not be captured by scRNA-seq data."
21
+
"Every cell of an organism shares the same DNA with the same set of functional units referred to as genes. With this in mind, what determines the tremendous diversity of cells reaching from natural killer cells of the immune system to neurons transmitting electrochemical signals throughout the body? In the previous chapters, we saw that cell identity and function can be inferred from gene expression profiles in each cell. The control of gene expression is driven by a complex interplay of regulatory mechanisms such as DNA methylation, histone modifications, and transcription factor activity. {term}`Chromatin` accessibility largely reflects the combined regulatory state of a cell, serving as an orthogonal layer of information to {term}`mRNA <Messenger RNA (mRNA)>` levels describing cell identity. Furthermore, exploring the chromatin accessibility profile enables additional insights into gene regulatory mechanisms and cell differentiation processes that might not be captured by scRNA-seq data."
"ECCITE-seq overview. mRNA is measured together with surface protein expression using antibody derived tags. Biological replicates are resolved through hashtag-derived oligonucleotides. The assignment of the guide RNAs is done using guide-derived oligonucleotides. Image obtained from (https://cite-seq.com/eccite-seq).\n",
1993
+
"ECCITE-seq overview. {term}`mRNA <Messenger RNA (mRNA)>` is measured together with surface protein expression using antibody derived tags. Biological replicates are resolved through hashtag-derived oligonucleotides. The assignment of the guide RNAs is done using guide-derived oligonucleotides. Image obtained from (https://cite-seq.com/eccite-seq).\n",
Copy file name to clipboardExpand all lines: jupyter-book/glossary.md
+16-13Lines changed: 16 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -73,7 +73,7 @@ Complementary DNA (cDNA)
73
73
Demultiplexing
74
74
The process of determining which sequencing reads belong to which cell using {term}`barcodes <Barcode>`.
75
75
76
-
directed graph
76
+
Directed graph
77
77
A directed graph (or digraph) is a graph consisting of a set of nodes (vertices) connected by edges (arcs), where each edge has a direction indicating a one-way relationship between nodes.
78
78
79
79
DNA
@@ -90,7 +90,7 @@ Downstream analysis
90
90
91
91
Dropout
92
92
A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type <Cell type>`.
93
-
The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression.
93
+
The reason for dropouts are commonly low amounts of {term}`mRNA <Messenger RNA (mRNA)>` expression in cells and the general stochasticity of mRNA expression.
94
94
Dropouts are one of the reasons why scRNA-seq data is sparse.
95
95
96
96
Drop-seq
@@ -136,6 +136,9 @@ Locus
136
136
In sequencing, loci refer to the potential origins of a read or fragment, such as a gene, exon, or intergenic region.
137
137
Accurate identification of loci is critical for mapping reads and understanding the genomic or transcriptomic context of the data.
138
138
139
+
Messenger RNA (mRNA)
140
+
A nucleotide sequence that has been read from a gene and serves as a blueprint for a protein.
141
+
139
142
MuData
140
143
A Python package for multimodal annotated data matrices that builds on {term}`AnnData`.
141
144
The primary data structure in the scverse ecosystem for multimodal data.
@@ -158,37 +161,37 @@ Poisson distribution
158
161
Discrete probability distribution denoting the probability of a specified number of events occurring in a fixed interval of time or space with the events occurring independently at a known constant mean rate.
159
162
160
163
Promoter
161
-
Sequence of DNA to which proteins bind to initiate and control transcription.
164
+
Sequence of DNA to which proteins bind (e.g. RNA polymerase and transcription factors) to initiate and control transcription.
162
165
163
166
Pseudotime
164
167
Latent and therefore unobserved dimension reflecting cells' progression through transitions.
165
168
Pseudotime is usually related to real time events, but not necessarily the same.
166
169
167
170
RNA
168
171
Ribonucleic acid (RNA) is a single-stranded nucleic acid present in all living cells that encodes and regulates gene expression.
169
-
Unlike DNA, RNA can be highly dynamic, acting as a messenger (mRNA) to carry genetic instructions, a structural or catalytic component (rRNA, snRNA), or a regulator of gene expression (miRNA, siRNA, lncRNA).
172
+
Unlike DNA, RNA can be highly dynamic, acting as a messenger ({term}`mRNA <Messenger RNA (mRNA)>`) to carry genetic instructions, a structural or catalytic component (rRNA, snRNA), or a regulator of gene expression (miRNA, siRNA, lncRNA).
170
173
RNA plays a central role in transcription, translation, and cellular responses, making it essential for understanding gene regulation, development, and disease.
171
174
172
175
RNA velocity
173
-
RNA velocity measures the rate of change in gene expression by comparing the ratio of unspliced (pre-mRNA) to spliced (mature) mRNA transcripts in single-cell RNA sequencing data.
176
+
RNA velocity measures the rate of change in gene expression by comparing the ratio of unspliced (pre-{term}`mRNA <Messenger RNA (mRNA)>`) to spliced (mature) mRNA transcripts in single-cell RNA sequencing data.
174
177
This ratio provides insight into whether genes are being actively transcribed (increasing expression) or degraded (decreasing expression), allowing researchers to predict the future state of cells.
175
178
The concept leverages the fact that pre-mRNA signals indicate new transcription while mature mRNA levels reflect steady-state expression, enabling inference of cellular trajectory and developmental dynamics.
176
179
177
-
RT-qPCR
178
-
Quantitative reverse transcription {term}`PCR` (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR.
179
-
180
180
SAM
181
181
SAM (Sequence Alignment/Map) files are tab-delimited text files that store sequencing alignment data, showing how sequencing reads map to a reference genome.
182
182
Each line in a SAM file contains information about a single read alignment, including the read sequence, base quality scores, mapping position, and mapping quality.
183
183
184
-
scanpy
184
+
Scanpy
185
185
A Python package for single-cell analysis in Python by scverse.
186
186
187
-
scverse
187
+
Scverse
188
188
A consortium for fundamental single-cell tools in the life sciences that are maintaining computational analysis tools like scanpy, muon and scvi-tools.
189
189
See: https://scverse.org/
190
190
191
-
signal-to-noise ratio
191
+
Sequencing
192
+
Sequencing is the process of deciphering the order of DNA nucleotides.
193
+
194
+
Signal-to-noise ratio
192
195
A measure of the clarity of a signal relative to background noise.
193
196
In sequencing, the signal represents the detectable information derived from the DNA or RNA molecules being sequenced, while the noise includes random errors or unwanted signals that can obscure or distort the true data.
194
197
A high signal-to-noise ratio (SNR) indicates that the signal is strong and reliable compared to the noise, resulting in better data quality.
@@ -199,7 +202,7 @@ Spike-in RNA
199
202
200
203
Splice Junctions
201
204
Locations where introns are removed, and exons are joined together in a mature RNA transcript during RNA splicing.
202
-
These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional mRNA.
205
+
These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional {term}`mRNA <Messenger RNA (mRNA)>`.
This, for example, enables the estimation of PCR duplication rates (see {term}`amplification bias <Amplification bias>`), which leads to error correction and increases accuracy.
211
214
212
215
Untranslated Region (UTR)
213
-
A segment of an mRNA transcript that is transcribed but not translated into protein.
216
+
A segment of an {term}`mRNA <Messenger RNA (mRNA)>` transcript that is transcribed but not translated into protein.
214
217
UTRs are located at both ends of the coding sequence.
0 commit comments