Skip to content

Commit 08886ad

Browse files
LuisHeinzlmeierLuis
andauthored
Update scRNA-seq chapter and add a paragraph on quantification of gene expression (#371)
* add dropdowns for Strength and limitations * shorten the first paragraph * links for new developments in sequencing * some notes for new developments in scRNA-Seq * notes for new developments * bullet points * first fact for bio paragraph * first text for bio chapter * ideas and first sentences for bio-chapter * continue writing the text * safe process * the current state of my ideas and the rough structure of the paragraph * adding text from splicing to protein/RNA ratio * missing changes to .bib * before the finalization of the paragraph * improve flow and wording of 'central dogma in numbers' * save current state of table progress * change to version pre-commit/action@v3.0.0 and insert new chapter to flow of the text * fix pdfhtml error * restructure scRNA-Seq protocols * add PIP-seq dropdown * add some more sentences and the questions * add figure for central dogma in numbers * correctly link batch effects * update key takeaways * first improvements * add a video reference and new terms to glossary * properties of negative binomial distribution * remove central dogma and improve figure * link seealso and new terms in the other chapters * add url and doi to bibliography * add a few new lines * transform to .ipynb * transform to .ipynb * small typo in .png * remove .key files * make key terms bold in key takeaways * add batch effect to chapters * add changelog * add link to next chapter --------- Co-authored-by: Luis <ge34lah@mytum.de>
1 parent 20b4397 commit 08886ad

24 files changed

+1613
-498
lines changed

changelog.d/371.added.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Update scRNA-seq chapter and add a paragraph on quantification of gene expression ([#371](https://github.com/theislab/single-cell-best-practices/pull/371)) <sub>@LuisHeinzlmeier</sub>
Binary file not shown.
Binary file not shown.
383 KB
Loading

jupyter-book/cellular_structure/annotation.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
"cell_type": "markdown",
2020
"metadata": {},
2121
"source": [
22-
"To understand your data better and make use of existing knowledge, it is important to figure out the \"cellular identity\" of each of the cells in your data. The process of labeling groups of cells in your data based on known (or sometimes unknown) cellular phenotypes is called \"cell annotation\". Whereas there are many ways to annotate your cells (e.g. based on batch, disease, sex and more), in this notebook we will focus on the annotation of \"cell types\".<br>\n",
22+
"To understand your data better and make use of existing knowledge, it is important to figure out the \"cellular identity\" of each of the cells in your data. The process of labeling groups of cells in your data based on known (or sometimes unknown) cellular phenotypes is called \"cell annotation\". Whereas there are many ways to annotate your cells (e.g. based on {term}`batch <batch effect>`, disease, sex and more), in this notebook we will focus on the annotation of \"cell types\".<br>\n",
2323
"So what is a cell type? Biologists use the term cell type to denote a cellular phenotype that is robust across datasets, identifiable based on expression of specific markers (i.e. proteins or gene transcripts), and often linked to specific functions. For example, a plasma B cell is a type of white blood cell that secretes antibodies used to fight pathogens and it can be identified using specific markers. Knowing which cell types are in your sample is essential in understanding your data. For example, knowing that there are specific immune cell types in a tumor or unusual hematopoietic stem cells in your bone marrow sample can be a valuable insight into the disease you might be studying.<br>\n",
2424
"However, like with any categorization the size of categories and the borders drawn between them are partly subjective and can change over time, e.g. because new technologies allow for a higher resolution view of cells, or because specific \"sub-phenotypes\" that were not considered biologically meaningful are found to have important biological implications (see e.g. {cite}`anno:KadurLakshminarasimhaMurthy2022`). Cell types are therefore often further classified into \"subtypes\" or \"cell states\" (e.g. activated versus resting) and some researchers use the term \"cell identity\" to avoid this sometimes arbitrary distinction of cell types, cell subtypes and cell states. For a more detailed discussion of this topic, we recommend the review by Wagner et al. {cite}`anno:Wagner2016` and the recently published review by Zeng {cite}`anno:ZENG20222739`.<br>\n",
2525
"Similarly, multiple cell types can be part of a single continuum, where one cell type might transition or differentiate into another. For example, in hematopoiesis cells differentiate from a stem cell into a specific immune cell type. Although hard borders between early and late stages of this differentiation are often drawn, the state of these cells can more accurately be described by the differentiation coordinate between the less and more differentiated cellular phenotypes. We will discuss differentiation and cellular trajectories in subsequent chapters.<br>\n",
@@ -807,7 +807,7 @@
807807
"cell_type": "markdown",
808808
"metadata": {},
809809
"source": [
810-
"You might notice that the annotation of B1 B cells is difficult, with none of the clusters expressing all the B1 B markers and several clusters expressing some of the markers. We often see that markers that work for one dataset do not work as well for others. This can be due to differences in sequencing depth, but also due to other sources of variation between datasets or samples. "
810+
"You might notice that the annotation of B1 B cells is difficult, with none of the clusters expressing all the B1 B markers and several clusters expressing some of the markers. We often see that markers that work for one dataset do not work as well for others. This can be due to differences in {term}`sequencing` depth, but also due to other sources of variation between datasets or samples. "
811811
]
812812
},
813813
{

jupyter-book/cellular_structure/clustering.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
"id": "99163c2d",
2525
"metadata": {},
2626
"source": [
27-
"Preprocessing and visualization enabled us to describe our scRNA-seq dataset and reduce its dimensionality. Up to this point, we embedded and visualized cells to understand the underlying properties of our dataset. However, they are still rather abstractly defined. The next natural step in single-cell analysis is the identification of cellular structure in the dataset. \n",
27+
"Preprocessing and visualization enabled us to describe our scRNA-{term}`seq <sequencing>` dataset and reduce its dimensionality. Up to this point, we embedded and visualized cells to understand the underlying properties of our dataset. However, they are still rather abstractly defined. The next natural step in single-cell analysis is the identification of cellular structure in the dataset. \n",
2828
"\n",
2929
"In scRNA-seq data analysis, we describe cellular structure in our dataset with finding cell identities that relate to known cell states or cell cycle stages. This process is usually called cell identity annotation. For this purpose, we structure cells into clusters to infer the identity of similar cells. Clustering itself is a common unsupervised machine learning problem. \n",
3030
"We can derive clusters by minimizing the intra-cluster distance in the reduced expression space. In this case, the expression space determines the gene expression similarity of cells with respect to a dimensionality-reduced representation. This lower dimensional representation is, for example, determined with a principal-component analysis and the similarity scoring is then based on Euclidean distances. \n",

jupyter-book/cellular_structure/integration.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
"(cellular-structure-integration-key-takeaway-1)=\n",
1818
"## Motivation\n",
1919
"\n",
20-
"A central challenge in most scRNA-seq data analyses is presented by batch effects. Batch effects are changes in measured expression levels that are the result of handling cells in distinct groups or “batches”. For example, a batch effect can arise if two labs have taken samples from the same cohort, but these samples are dissociated differently. If Lab A optimizes its dissociation protocol to dissociate cells in the sample while minimizing the stress on them, and Lab B does not, then it is likely that the cells in the data from the group B will express more stress-linked genes (JUN, JUNB, FOS, etc. see {cite}`Van_den_Brink2017-si`) even if the cells had the same profile in the original tissue. In general, the origins of batch effects are diverse and difficult to pin down. Some batch effect sources might be technical such as differences in sample handling, experimental protocols, or sequencing depths, but biological effects such as donor variation, tissue, or sampling location are also often interpreted as a batch effect {cite}`Luecken2021-jo`. Whether or not biological factors should be considered batch effects can depend on the experimental design and the question being asked. Removing batch effects is crucial to enable joint analysis that can focus on finding common structure in the data across batches and enable us to perform queries across datasets. Often it is only after removing these effects that rare cell populations can be identified that were previously obscured by differences between batches. Enabling queries across datasets allows us to ask questions that could not be answered by analysing individual datasets, such as _Which cell types express SARS-CoV-2 entry factors and how does this expression differ between individuals?_ {cite}`Muus2021-ti`."
20+
"A central challenge in most scRNA-seq data analyses is presented by batch effects. Batch effects are changes in measured expression levels that are the result of handling cells in distinct groups or “batches”. For example, a batch effect can arise if two labs have taken samples from the same cohort, but these samples are dissociated differently. If Lab A optimizes its dissociation protocol to dissociate cells in the sample while minimizing the stress on them, and Lab B does not, then it is likely that the cells in the data from the group B will express more stress-linked genes (JUN, JUNB, FOS, etc. see {cite}`Van_den_Brink2017-si`) even if the cells had the same profile in the original tissue. In general, the origins of batch effects are diverse and difficult to pin down. Some batch effect sources might be technical such as differences in sample handling, experimental protocols, or {term}`sequencing` depths, but biological effects such as donor variation, tissue, or sampling location are also often interpreted as a batch effect {cite}`Luecken2021-jo`. Whether or not biological factors should be considered batch effects can depend on the experimental design and the question being asked. Removing batch effects is crucial to enable joint analysis that can focus on finding common structure in the data across batches and enable us to perform queries across datasets. Often it is only after removing these effects that rare cell populations can be identified that were previously obscured by differences between batches. Enabling queries across datasets allows us to ask questions that could not be answered by analysing individual datasets, such as _Which cell types express SARS-CoV-2 entry factors and how does this expression differ between individuals?_ {cite}`Muus2021-ti`."
2121
]
2222
},
2323
{

jupyter-book/chromatin_accessibility/introduction.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
"(chromatin-accessibility-introduction-key-takeaway-1)=\n",
1919
"## Motivation\n",
2020
"\n",
21-
"Every cell of an organism shares the same DNA with the same set of functional units referred to as genes. With this in mind, what determines the tremendous diversity of cells reaching from natural killer cells of the immune system to neurons transmitting electrochemical signals throughout the body? In the previous chapters, we saw that cell identity and function can be inferred from gene expression profiles in each cell. The control of gene expression is driven by a complex interplay of regulatory mechanisms such as DNA methylation, histone modifications, and transcription factor activity. {term}`Chromatin` accessibility largely reflects the combined regulatory state of a cell, serving as an orthogonal layer of information to mRNA levels describing cell identity. Furthermore, exploring the chromatin accessibility profile enables additional insights into gene regulatory mechanisms and cell differentiation processes that might not be captured by scRNA-seq data."
21+
"Every cell of an organism shares the same DNA with the same set of functional units referred to as genes. With this in mind, what determines the tremendous diversity of cells reaching from natural killer cells of the immune system to neurons transmitting electrochemical signals throughout the body? In the previous chapters, we saw that cell identity and function can be inferred from gene expression profiles in each cell. The control of gene expression is driven by a complex interplay of regulatory mechanisms such as DNA methylation, histone modifications, and transcription factor activity. {term}`Chromatin` accessibility largely reflects the combined regulatory state of a cell, serving as an orthogonal layer of information to {term}`mRNA <Messenger RNA (mRNA)>` levels describing cell identity. Furthermore, exploring the chromatin accessibility profile enables additional insights into gene regulatory mechanisms and cell differentiation processes that might not be captured by scRNA-seq data."
2222
]
2323
},
2424
{

jupyter-book/conditions/perturbation_modeling.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1990,7 +1990,7 @@
19901990
"\n",
19911991
"<img src=\"../_static/images/conditions/eccite.png\" alt=\"ECCITE-seq Overview\" class=\"bg-primary mb-1\" width=\"800px\">\n",
19921992
"\n",
1993-
"ECCITE-seq overview. mRNA is measured together with surface protein expression using antibody derived tags. Biological replicates are resolved through hashtag-derived oligonucleotides. The assignment of the guide RNAs is done using guide-derived oligonucleotides. Image obtained from (https://cite-seq.com/eccite-seq).\n",
1993+
"ECCITE-seq overview. {term}`mRNA <Messenger RNA (mRNA)>` is measured together with surface protein expression using antibody derived tags. Biological replicates are resolved through hashtag-derived oligonucleotides. The assignment of the guide RNAs is done using guide-derived oligonucleotides. Image obtained from (https://cite-seq.com/eccite-seq).\n",
19941994
"\n",
19951995
":::"
19961996
]

jupyter-book/glossary.md

Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ Complementary DNA (cDNA)
7373
Demultiplexing
7474
The process of determining which sequencing reads belong to which cell using {term}`barcodes <Barcode>`.
7575
76-
directed graph
76+
Directed graph
7777
A directed graph (or digraph) is a graph consisting of a set of nodes (vertices) connected by edges (arcs), where each edge has a direction indicating a one-way relationship between nodes.
7878
7979
DNA
@@ -90,7 +90,7 @@ Downstream analysis
9090
9191
Dropout
9292
A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type <Cell type>`.
93-
The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression.
93+
The reason for dropouts are commonly low amounts of {term}`mRNA <Messenger RNA (mRNA)>` expression in cells and the general stochasticity of mRNA expression.
9494
Dropouts are one of the reasons why scRNA-seq data is sparse.
9595
9696
Drop-seq
@@ -136,6 +136,9 @@ Locus
136136
In sequencing, loci refer to the potential origins of a read or fragment, such as a gene, exon, or intergenic region.
137137
Accurate identification of loci is critical for mapping reads and understanding the genomic or transcriptomic context of the data.
138138
139+
Messenger RNA (mRNA)
140+
A nucleotide sequence that has been read from a gene and serves as a blueprint for a protein.
141+
139142
MuData
140143
A Python package for multimodal annotated data matrices that builds on {term}`AnnData`.
141144
The primary data structure in the scverse ecosystem for multimodal data.
@@ -158,37 +161,37 @@ Poisson distribution
158161
Discrete probability distribution denoting the probability of a specified number of events occurring in a fixed interval of time or space with the events occurring independently at a known constant mean rate.
159162
160163
Promoter
161-
Sequence of DNA to which proteins bind to initiate and control transcription.
164+
Sequence of DNA to which proteins bind (e.g. RNA polymerase and transcription factors) to initiate and control transcription.
162165
163166
Pseudotime
164167
Latent and therefore unobserved dimension reflecting cells' progression through transitions.
165168
Pseudotime is usually related to real time events, but not necessarily the same.
166169
167170
RNA
168171
Ribonucleic acid (RNA) is a single-stranded nucleic acid present in all living cells that encodes and regulates gene expression.
169-
Unlike DNA, RNA can be highly dynamic, acting as a messenger (mRNA) to carry genetic instructions, a structural or catalytic component (rRNA, snRNA), or a regulator of gene expression (miRNA, siRNA, lncRNA).
172+
Unlike DNA, RNA can be highly dynamic, acting as a messenger ({term}`mRNA <Messenger RNA (mRNA)>`) to carry genetic instructions, a structural or catalytic component (rRNA, snRNA), or a regulator of gene expression (miRNA, siRNA, lncRNA).
170173
RNA plays a central role in transcription, translation, and cellular responses, making it essential for understanding gene regulation, development, and disease.
171174
172175
RNA velocity
173-
RNA velocity measures the rate of change in gene expression by comparing the ratio of unspliced (pre-mRNA) to spliced (mature) mRNA transcripts in single-cell RNA sequencing data.
176+
RNA velocity measures the rate of change in gene expression by comparing the ratio of unspliced (pre-{term}`mRNA <Messenger RNA (mRNA)>`) to spliced (mature) mRNA transcripts in single-cell RNA sequencing data.
174177
This ratio provides insight into whether genes are being actively transcribed (increasing expression) or degraded (decreasing expression), allowing researchers to predict the future state of cells.
175178
The concept leverages the fact that pre-mRNA signals indicate new transcription while mature mRNA levels reflect steady-state expression, enabling inference of cellular trajectory and developmental dynamics.
176179
177-
RT-qPCR
178-
Quantitative reverse transcription {term}`PCR` (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR.
179-
180180
SAM
181181
SAM (Sequence Alignment/Map) files are tab-delimited text files that store sequencing alignment data, showing how sequencing reads map to a reference genome.
182182
Each line in a SAM file contains information about a single read alignment, including the read sequence, base quality scores, mapping position, and mapping quality.
183183
184-
scanpy
184+
Scanpy
185185
A Python package for single-cell analysis in Python by scverse.
186186
187-
scverse
187+
Scverse
188188
A consortium for fundamental single-cell tools in the life sciences that are maintaining computational analysis tools like scanpy, muon and scvi-tools.
189189
See: https://scverse.org/
190190
191-
signal-to-noise ratio
191+
Sequencing
192+
Sequencing is the process of deciphering the order of DNA nucleotides.
193+
194+
Signal-to-noise ratio
192195
A measure of the clarity of a signal relative to background noise.
193196
In sequencing, the signal represents the detectable information derived from the DNA or RNA molecules being sequenced, while the noise includes random errors or unwanted signals that can obscure or distort the true data.
194197
A high signal-to-noise ratio (SNR) indicates that the signal is strong and reliable compared to the noise, resulting in better data quality.
@@ -199,7 +202,7 @@ Spike-in RNA
199202
200203
Splice Junctions
201204
Locations where introns are removed, and exons are joined together in a mature RNA transcript during RNA splicing.
202-
These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional mRNA.
205+
These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional {term}`mRNA <Messenger RNA (mRNA)>`.
203206
204207
Trajectory inference
205208
Also known as pseudotemporal ordering.
@@ -210,6 +213,6 @@ Unique Molecular Identifier (UMI)
210213
This, for example, enables the estimation of PCR duplication rates (see {term}`amplification bias <Amplification bias>`), which leads to error correction and increases accuracy.
211214
212215
Untranslated Region (UTR)
213-
A segment of an mRNA transcript that is transcribed but not translated into protein.
216+
A segment of an {term}`mRNA <Messenger RNA (mRNA)>` transcript that is transcribed but not translated into protein.
214217
UTRs are located at both ends of the coding sequence.
215218
```

0 commit comments

Comments
 (0)