theislab
diff --git a/‎changelog.d/371.added.md‎
Lines changed: 1 addition & 0 deletions b/‎changelog.d/371.added.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎jupyter-book/_static/images/raw_data_processing/alignment_vs_mapping.key‎
-658 KB b/‎jupyter-book/_static/images/raw_data_processing/alignment_vs_mapping.key‎
-658 KB
diff --git a/‎jupyter-book/_static/images/raw_data_processing/overview_raw_data_processing.key‎
-2.34 MB b/‎jupyter-book/_static/images/raw_data_processing/overview_raw_data_processing.key‎
-2.34 MB
diff --git a/‎jupyter-book/_static/images/scrna_seq/quantifying_gene_expression.png‎
383 KB b/‎jupyter-book/_static/images/scrna_seq/quantifying_gene_expression.png‎
383 KB
diff --git a/‎jupyter-book/cellular_structure/annotation.ipynb‎
Lines changed: 2 additions & 2 deletions b/‎jupyter-book/cellular_structure/annotation.ipynb‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎jupyter-book/cellular_structure/clustering.ipynb‎
Lines changed: 1 addition & 1 deletion b/‎jupyter-book/cellular_structure/clustering.ipynb‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎jupyter-book/cellular_structure/integration.ipynb‎
Lines changed: 1 addition & 1 deletion b/‎jupyter-book/cellular_structure/integration.ipynb‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎jupyter-book/chromatin_accessibility/introduction.ipynb‎
Lines changed: 1 addition & 1 deletion b/‎jupyter-book/chromatin_accessibility/introduction.ipynb‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎jupyter-book/conditions/perturbation_modeling.ipynb‎
Lines changed: 1 addition & 1 deletion b/‎jupyter-book/conditions/perturbation_modeling.ipynb‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎jupyter-book/glossary.md‎
Lines changed: 16 additions & 13 deletions b/‎jupyter-book/glossary.md‎
Lines changed: 16 additions & 13 deletions
@@ -0,0 +1 @@
+Update scRNA-seq chapter and add a paragraph on quantification of gene expression ([#371](https://github.com/theislab/single-cell-best-practices/pull/371)) <sub>@LuisHeinzlmeier</sub>
@@ -19,7 +19,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To understand your data better and make use of existing knowledge, it is important to figure out the \"cellular identity\" of each of the cells in your data. The process of labeling groups of cells in your data based on known (or sometimes unknown) cellular phenotypes is called \"cell annotation\". Whereas there are many ways to annotate your cells (e.g. based on batch, disease, sex and more), in this notebook we will focus on the annotation of \"cell types\".<br>\n",
+    "To understand your data better and make use of existing knowledge, it is important to figure out the \"cellular identity\" of each of the cells in your data. The process of labeling groups of cells in your data based on known (or sometimes unknown) cellular phenotypes is called \"cell annotation\". Whereas there are many ways to annotate your cells (e.g. based on {term}`batch <batch effect>`, disease, sex and more), in this notebook we will focus on the annotation of \"cell types\".<br>\n",
     "So what is a cell type? Biologists use the term cell type to denote a cellular phenotype that is robust across datasets, identifiable based on expression of specific markers (i.e. proteins or gene transcripts), and often linked to specific functions. For example, a plasma B cell is a type of white blood cell that secretes antibodies used to fight pathogens and it can be identified using specific markers. Knowing which cell types are in your sample is essential in understanding your data. For example, knowing that there are specific immune cell types in a tumor or unusual hematopoietic stem cells in your bone marrow sample can be a valuable insight into the disease you might be studying.<br>\n",
     "However, like with any categorization the size of categories and the borders drawn between them are partly subjective and can change over time, e.g. because new technologies allow for a higher resolution view of cells, or because specific \"sub-phenotypes\" that were not considered biologically meaningful are found to have important biological implications (see e.g. {cite}`anno:KadurLakshminarasimhaMurthy2022`). Cell types are therefore often further classified into \"subtypes\" or \"cell states\" (e.g. activated versus resting) and some researchers use the term \"cell identity\" to avoid this sometimes arbitrary distinction of cell types, cell subtypes and cell states. For a more detailed discussion of this topic, we recommend the review by Wagner et al. {cite}`anno:Wagner2016` and the recently published review by Zeng {cite}`anno:ZENG20222739`.<br>\n",
     "Similarly, multiple cell types can be part of a single continuum, where one cell type might transition or differentiate into another. For example, in hematopoiesis cells differentiate from a stem cell into a specific immune cell type. Although hard borders between early and late stages of this differentiation are often drawn, the state of these cells can more accurately be described by the differentiation coordinate between the less and more differentiated cellular phenotypes. We will discuss differentiation and cellular trajectories in subsequent chapters.<br>\n",
@@ -807,7 +807,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You might notice that the annotation of B1 B cells is difficult, with none of the clusters expressing all the B1 B markers and several clusters expressing some of the markers. We often see that markers that work for one dataset do not work as well for others. This can be due to differences in sequencing depth, but also due to other sources of variation between datasets or samples. "
+    "You might notice that the annotation of B1 B cells is difficult, with none of the clusters expressing all the B1 B markers and several clusters expressing some of the markers. We often see that markers that work for one dataset do not work as well for others. This can be due to differences in {term}`sequencing` depth, but also due to other sources of variation between datasets or samples. "
    ]
   },
   {
 
@@ -24,7 +24,7 @@
    "id": "99163c2d",
    "metadata": {},
    "source": [
-    "Preprocessing and visualization enabled us to describe our scRNA-seq dataset and reduce its dimensionality. Up to this point, we embedded and visualized cells to understand the underlying properties of our dataset. However, they are still rather abstractly defined. The next natural step in single-cell analysis is the identification of cellular structure in the dataset. \n",
+    "Preprocessing and visualization enabled us to describe our scRNA-{term}`seq <sequencing>` dataset and reduce its dimensionality. Up to this point, we embedded and visualized cells to understand the underlying properties of our dataset. However, they are still rather abstractly defined. The next natural step in single-cell analysis is the identification of cellular structure in the dataset. \n",
     "\n",
     "In scRNA-seq data analysis, we describe cellular structure in our dataset with finding cell identities that relate to known cell states or cell cycle stages. This process is usually called cell identity annotation. For this purpose, we structure cells into clusters to infer the identity of similar cells. Clustering itself is a common unsupervised machine learning problem. \n",
     "We can derive clusters by minimizing the intra-cluster distance in the reduced expression space. In this case, the expression space determines the gene expression similarity of cells with respect to a dimensionality-reduced representation. This lower dimensional representation is, for example, determined with a principal-component analysis and the similarity scoring is then based on Euclidean distances. \n",
 
@@ -17,7 +17,7 @@
     "(cellular-structure-integration-key-takeaway-1)=\n",
     "## Motivation\n",
     "\n",
-    "A central challenge in most scRNA-seq data analyses is presented by batch effects. Batch effects are changes in measured expression levels that are the result of handling cells in distinct groups or “batches”. For example, a batch effect can arise if two labs have taken samples from the same cohort, but these samples are dissociated differently. If Lab A optimizes its dissociation protocol to dissociate cells in the sample while minimizing the stress on them, and Lab B does not, then it is likely that the cells in the data from the group B will express more stress-linked genes (JUN, JUNB, FOS, etc. see {cite}`Van_den_Brink2017-si`) even if the cells had the same profile in the original tissue. In general, the origins of batch effects are diverse and difficult to pin down. Some batch effect sources might be technical such as differences in sample handling, experimental protocols, or sequencing depths, but biological effects such as donor variation, tissue, or sampling location are also often interpreted as a batch effect {cite}`Luecken2021-jo`. Whether or not biological factors should be considered batch effects can depend on the experimental design and the question being asked. Removing batch effects is crucial to enable joint analysis that can focus on finding common structure in the data across batches and enable us to perform queries across datasets. Often it is only after removing these effects that rare cell populations can be identified that were previously obscured by differences between batches. Enabling queries across datasets allows us to ask questions that could not be answered by analysing individual datasets, such as _Which cell types express SARS-CoV-2 entry factors and how does this expression differ between individuals?_ {cite}`Muus2021-ti`."
+    "A central challenge in most scRNA-seq data analyses is presented by batch effects. Batch effects are changes in measured expression levels that are the result of handling cells in distinct groups or “batches”. For example, a batch effect can arise if two labs have taken samples from the same cohort, but these samples are dissociated differently. If Lab A optimizes its dissociation protocol to dissociate cells in the sample while minimizing the stress on them, and Lab B does not, then it is likely that the cells in the data from the group B will express more stress-linked genes (JUN, JUNB, FOS, etc. see {cite}`Van_den_Brink2017-si`) even if the cells had the same profile in the original tissue. In general, the origins of batch effects are diverse and difficult to pin down. Some batch effect sources might be technical such as differences in sample handling, experimental protocols, or {term}`sequencing` depths, but biological effects such as donor variation, tissue, or sampling location are also often interpreted as a batch effect {cite}`Luecken2021-jo`. Whether or not biological factors should be considered batch effects can depend on the experimental design and the question being asked. Removing batch effects is crucial to enable joint analysis that can focus on finding common structure in the data across batches and enable us to perform queries across datasets. Often it is only after removing these effects that rare cell populations can be identified that were previously obscured by differences between batches. Enabling queries across datasets allows us to ask questions that could not be answered by analysing individual datasets, such as _Which cell types express SARS-CoV-2 entry factors and how does this expression differ between individuals?_ {cite}`Muus2021-ti`."
    ]
   },
   {
 
@@ -18,7 +18,7 @@
         "(chromatin-accessibility-introduction-key-takeaway-1)=\n",
         "## Motivation\n",
         "\n",
-        "Every cell of an organism shares the same DNA with the same set of functional units referred to as genes. With this in mind, what determines the tremendous diversity of cells reaching from natural killer cells of the immune system to neurons transmitting electrochemical signals throughout the body? In the previous chapters, we saw that cell identity and function can be inferred from gene expression profiles in each cell. The control of gene expression is driven by a complex interplay of regulatory mechanisms such as DNA methylation, histone modifications, and transcription factor activity. {term}`Chromatin` accessibility largely reflects the combined regulatory state of a cell, serving as an orthogonal layer of information to mRNA levels describing cell identity. Furthermore, exploring the chromatin accessibility profile enables additional insights into gene regulatory mechanisms and cell differentiation processes that might not be captured by scRNA-seq data."
+        "Every cell of an organism shares the same DNA with the same set of functional units referred to as genes. With this in mind, what determines the tremendous diversity of cells reaching from natural killer cells of the immune system to neurons transmitting electrochemical signals throughout the body? In the previous chapters, we saw that cell identity and function can be inferred from gene expression profiles in each cell. The control of gene expression is driven by a complex interplay of regulatory mechanisms such as DNA methylation, histone modifications, and transcription factor activity. {term}`Chromatin` accessibility largely reflects the combined regulatory state of a cell, serving as an orthogonal layer of information to {term}`mRNA <Messenger RNA (mRNA)>` levels describing cell identity. Furthermore, exploring the chromatin accessibility profile enables additional insights into gene regulatory mechanisms and cell differentiation processes that might not be captured by scRNA-seq data."
       ]
     },
     {
 
@@ -1990,7 +1990,7 @@
     "\n",
     "<img src=\"../_static/images/conditions/eccite.png\" alt=\"ECCITE-seq Overview\" class=\"bg-primary mb-1\" width=\"800px\">\n",
     "\n",
-    "ECCITE-seq overview. mRNA is measured together with surface protein expression using antibody derived tags. Biological replicates are resolved through hashtag-derived oligonucleotides. The assignment of the guide RNAs is done using guide-derived oligonucleotides. Image obtained from (https://cite-seq.com/eccite-seq).\n",
+    "ECCITE-seq overview. {term}`mRNA <Messenger RNA (mRNA)>` is measured together with surface protein expression using antibody derived tags. Biological replicates are resolved through hashtag-derived oligonucleotides. The assignment of the guide RNAs is done using guide-derived oligonucleotides. Image obtained from (https://cite-seq.com/eccite-seq).\n",
     "\n",
     ":::"
    ]
 
@@ -73,7 +73,7 @@ Complementary DNA (cDNA)
 Demultiplexing
     The process of determining which sequencing reads belong to which cell using {term}`barcodes <Barcode>`.
 
-directed graph
+Directed graph
     A directed graph (or digraph) is a graph consisting of a set of nodes (vertices) connected by edges (arcs), where each edge has a direction indicating a one-way relationship between nodes.
 
 DNA
@@ -90,7 +90,7 @@ Downstream analysis
 
 Dropout
     A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type <Cell type>`.
-    The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression.
+    The reason for dropouts are commonly low amounts of {term}`mRNA <Messenger RNA (mRNA)>` expression in cells and the general stochasticity of mRNA expression.
     Dropouts are one of the reasons why scRNA-seq data is sparse.
 
 Drop-seq
@@ -136,6 +136,9 @@ Locus
     In sequencing, loci refer to the potential origins of a read or fragment, such as a gene, exon, or intergenic region.
     Accurate identification of loci is critical for mapping reads and understanding the genomic or transcriptomic context of the data.
 
+Messenger RNA (mRNA)
+    A nucleotide sequence that has been read from a gene and serves as a blueprint for a protein.
+
 MuData
     A Python package for multimodal annotated data matrices that builds on {term}`AnnData`.
     The primary data structure in the scverse ecosystem for multimodal data.
@@ -158,37 +161,37 @@ Poisson distribution
     Discrete probability distribution denoting the probability of a specified number of events occurring in a fixed interval of time or space with the events occurring independently at a known constant mean rate.
 
 Promoter
-    Sequence of DNA to which proteins bind to initiate and control transcription.
+    Sequence of DNA to which proteins bind (e.g. RNA polymerase and transcription factors) to initiate and control transcription.
 
 Pseudotime
     Latent and therefore unobserved dimension reflecting cells' progression through transitions.
     Pseudotime is usually related to real time events, but not necessarily the same.
 
 RNA
     Ribonucleic acid (RNA) is a single-stranded nucleic acid present in all living cells that encodes and regulates gene expression.
-    Unlike DNA, RNA can be highly dynamic, acting as a messenger (mRNA) to carry genetic instructions, a structural or catalytic component (rRNA, snRNA), or a regulator of gene expression (miRNA, siRNA, lncRNA).
+    Unlike DNA, RNA can be highly dynamic, acting as a messenger ({term}`mRNA <Messenger RNA (mRNA)>`) to carry genetic instructions, a structural or catalytic component (rRNA, snRNA), or a regulator of gene expression (miRNA, siRNA, lncRNA).
     RNA plays a central role in transcription, translation, and cellular responses, making it essential for understanding gene regulation, development, and disease.
 
 RNA velocity
-    RNA velocity measures the rate of change in gene expression by comparing the ratio of unspliced (pre-mRNA) to spliced (mature) mRNA transcripts in single-cell RNA sequencing data.
+    RNA velocity measures the rate of change in gene expression by comparing the ratio of unspliced (pre-{term}`mRNA <Messenger RNA (mRNA)>`) to spliced (mature) mRNA transcripts in single-cell RNA sequencing data.
     This ratio provides insight into whether genes are being actively transcribed (increasing expression) or degraded (decreasing expression), allowing researchers to predict the future state of cells.
     The concept leverages the fact that pre-mRNA signals indicate new transcription while mature mRNA levels reflect steady-state expression, enabling inference of cellular trajectory and developmental dynamics.
 
-RT-qPCR
-    Quantitative reverse transcription {term}`PCR` (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR.
-
 SAM
     SAM (Sequence Alignment/Map) files are tab-delimited text files that store sequencing alignment data, showing how sequencing reads map to a reference genome.
     Each line in a SAM file contains information about a single read alignment, including the read sequence, base quality scores, mapping position, and mapping quality.
 
-scanpy
+Scanpy
     A Python package for single-cell analysis in Python by scverse.
 
-scverse
+Scverse
     A consortium for fundamental single-cell tools in the life sciences that are maintaining computational analysis tools like scanpy, muon and scvi-tools.
     See: https://scverse.org/
 
-signal-to-noise ratio
+Sequencing
+    Sequencing is the process of deciphering the order of DNA nucleotides.
+
+Signal-to-noise ratio
     A measure of the clarity of a signal relative to background noise.
     In sequencing, the signal represents the detectable information derived from the DNA or RNA molecules being sequenced, while the noise includes random errors or unwanted signals that can obscure or distort the true data.
     A high signal-to-noise ratio (SNR) indicates that the signal is strong and reliable compared to the noise, resulting in better data quality.
@@ -199,7 +202,7 @@ Spike-in RNA
 
 Splice Junctions
     Locations where introns are removed, and exons are joined together in a mature RNA transcript during RNA splicing.
-    These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional mRNA.
+    These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional {term}`mRNA <Messenger RNA (mRNA)>`.
 
 Trajectory inference
     Also known as pseudotemporal ordering.
@@ -210,6 +213,6 @@ Unique Molecular Identifier (UMI)
     This, for example, enables the estimation of PCR duplication rates (see {term}`amplification bias <Amplification bias>`), which leads to error correction and increases accuracy.
 
 Untranslated Region (UTR)
-    A segment of an mRNA transcript that is transcribed but not translated into protein.
+    A segment of an {term}`mRNA <Messenger RNA (mRNA)>` transcript that is transcribed but not translated into protein.
     UTRs are located at both ends of the coding sequence.
 ```
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Update scRNA-seq chapter and add a paragraph on quantification of gene expression ([#371](https://github.com/theislab/single-cell-best-practices/pull/371)) <sub>@LuisHeinzlmeier</sub>`
Original file line number	Diff line number	Diff line change
`@@ -17,7 +17,7 @@`
`17`	`17`	`"(cellular-structure-integration-key-takeaway-1)=\n",`
`18`	`18`	`"## Motivation\n",`
`19`	`19`	`"\n",`
`20`		- "A central challenge in most scRNA-seq data analyses is presented by batch effects. Batch effects are changes in measured expression levels that are the result of handling cells in distinct groups or “batches”. For example, a batch effect can arise if two labs have taken samples from the same cohort, but these samples are dissociated differently. If Lab A optimizes its dissociation protocol to dissociate cells in the sample while minimizing the stress on them, and Lab B does not, then it is likely that the cells in the data from the group B will express more stress-linked genes (JUN, JUNB, FOS, etc. see {cite}`Van_den_Brink2017-si`) even if the cells had the same profile in the original tissue. In general, the origins of batch effects are diverse and difficult to pin down. Some batch effect sources might be technical such as differences in sample handling, experimental protocols, or sequencing depths, but biological effects such as donor variation, tissue, or sampling location are also often interpreted as a batch effect {cite}`Luecken2021-jo`. Whether or not biological factors should be considered batch effects can depend on the experimental design and the question being asked. Removing batch effects is crucial to enable joint analysis that can focus on finding common structure in the data across batches and enable us to perform queries across datasets. Often it is only after removing these effects that rare cell populations can be identified that were previously obscured by differences between batches. Enabling queries across datasets allows us to ask questions that could not be answered by analysing individual datasets, such as _Which cell types express SARS-CoV-2 entry factors and how does this expression differ between individuals?_ {cite}`Muus2021-ti`."
	`20`	+ "A central challenge in most scRNA-seq data analyses is presented by batch effects. Batch effects are changes in measured expression levels that are the result of handling cells in distinct groups or “batches”. For example, a batch effect can arise if two labs have taken samples from the same cohort, but these samples are dissociated differently. If Lab A optimizes its dissociation protocol to dissociate cells in the sample while minimizing the stress on them, and Lab B does not, then it is likely that the cells in the data from the group B will express more stress-linked genes (JUN, JUNB, FOS, etc. see {cite}`Van_den_Brink2017-si`) even if the cells had the same profile in the original tissue. In general, the origins of batch effects are diverse and difficult to pin down. Some batch effect sources might be technical such as differences in sample handling, experimental protocols, or {term}`sequencing` depths, but biological effects such as donor variation, tissue, or sampling location are also often interpreted as a batch effect {cite}`Luecken2021-jo`. Whether or not biological factors should be considered batch effects can depend on the experimental design and the question being asked. Removing batch effects is crucial to enable joint analysis that can focus on finding common structure in the data across batches and enable us to perform queries across datasets. Often it is only after removing these effects that rare cell populations can be identified that were previously obscured by differences between batches. Enabling queries across datasets allows us to ask questions that could not be answered by analysing individual datasets, such as _Which cell types express SARS-CoV-2 entry factors and how does this expression differ between individuals?_ {cite}`Muus2021-ti`."
`21`	`21`	`]`
`22`	`22`	`},`
`23`	`23`	`{`
Original file line number	Diff line number	Diff line change
`@@ -18,7 +18,7 @@`
`18`	`18`	`"(chromatin-accessibility-introduction-key-takeaway-1)=\n",`
`19`	`19`	`"## Motivation\n",`
`20`	`20`	`"\n",`
`21`		- "Every cell of an organism shares the same DNA with the same set of functional units referred to as genes. With this in mind, what determines the tremendous diversity of cells reaching from natural killer cells of the immune system to neurons transmitting electrochemical signals throughout the body? In the previous chapters, we saw that cell identity and function can be inferred from gene expression profiles in each cell. The control of gene expression is driven by a complex interplay of regulatory mechanisms such as DNA methylation, histone modifications, and transcription factor activity. {term}`Chromatin` accessibility largely reflects the combined regulatory state of a cell, serving as an orthogonal layer of information to mRNA levels describing cell identity. Furthermore, exploring the chromatin accessibility profile enables additional insights into gene regulatory mechanisms and cell differentiation processes that might not be captured by scRNA-seq data."
	`21`	+ "Every cell of an organism shares the same DNA with the same set of functional units referred to as genes. With this in mind, what determines the tremendous diversity of cells reaching from natural killer cells of the immune system to neurons transmitting electrochemical signals throughout the body? In the previous chapters, we saw that cell identity and function can be inferred from gene expression profiles in each cell. The control of gene expression is driven by a complex interplay of regulatory mechanisms such as DNA methylation, histone modifications, and transcription factor activity. {term}`Chromatin` accessibility largely reflects the combined regulatory state of a cell, serving as an orthogonal layer of information to {term}`mRNA <Messenger RNA (mRNA)>` levels describing cell identity. Furthermore, exploring the chromatin accessibility profile enables additional insights into gene regulatory mechanisms and cell differentiation processes that might not be captured by scRNA-seq data."
`22`	`22`	`]`
`23`	`23`	`},`
`24`	`24`	`{`
Original file line number	Diff line number	Diff line change
`@@ -1990,7 +1990,7 @@`
`1990`	`1990`	`"\n",`
`1991`	`1991`	`"<img src=\"../_static/images/conditions/eccite.png\" alt=\"ECCITE-seq Overview\" class=\"bg-primary mb-1\" width=\"800px\">\n",`
`1992`	`1992`	`"\n",`
`1993`		`- "ECCITE-seq overview. mRNA is measured together with surface protein expression using antibody derived tags. Biological replicates are resolved through hashtag-derived oligonucleotides. The assignment of the guide RNAs is done using guide-derived oligonucleotides. Image obtained from (https://cite-seq.com/eccite-seq).\n",`
	`1993`	+ "ECCITE-seq overview. {term}`mRNA <Messenger RNA (mRNA)>` is measured together with surface protein expression using antibody derived tags. Biological replicates are resolved through hashtag-derived oligonucleotides. The assignment of the guide RNAs is done using guide-derived oligonucleotides. Image obtained from (https://cite-seq.com/eccite-seq).\n",
`1994`	`1994`	`"\n",`
`1995`	`1995`	`":::"`
`1996`	`1996`	`]`