Skip to content

Commit c05bfa0

Browse files
committed
Updates to RNAseq pipeline document
- Add inputs and outputs for Step 9 sections - fix annotations table lines
1 parent cf6bde3 commit c05bfa0

File tree

1 file changed

+99
-16
lines changed

1 file changed

+99
-16
lines changed

RNAseq/Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-G.md

Lines changed: 99 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1102,7 +1102,7 @@ echo "*: ${rRNA_count} rRNA entries removed." > *_rRNA_counts.txt
11021102
## 9. Normalize Read Counts and Perform Differential Gene Expression Analysis
11031103

11041104
> Note: DGE Analysis is performed twice with different sets of input files:
1105-
> 1. Using RSEM genes.results files (*genes.results, output from [Step 8a](#8a-count-aligned-reads-with-rsem)))
1105+
> 1. Using RSEM genes.results files (*genes.results, output from [Step 8a](#8a-count-aligned-reads-with-rsem))
11061106
> 2. Using rRNA-removed RSEM genes.results files (*rRNA_removed.genes.results, output from [Step 8dii](#8dii-filter-rrna-genes-from-rsem-genes-results))
11071107
11081108
<br>
@@ -1195,14 +1195,14 @@ DGE_output="/path/to/DGE/output/directory"
11951195

11961196
### Pull in the GeneLab annotation table (GL-DPPD-7110-A_annotations.csv) file ###
11971197

1198-
org_table_link <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv"
1198+
org_table_link <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv"
11991199

12001200
org_table <- read.table(org_table_link, sep = ",", header = TRUE)
12011201

12021202

12031203
### Define the link to the GeneLab annotation table for the organism of interest ###
12041204

1205-
annotations_link <- org_table[org_table$name == organism, "genelab_annots_link"]
1205+
annotations_link <- org_table[org_table$species == organism, "genelab_annots_link"]
12061206

12071207

12081208
### Set your working directory to the directory where you will execute your DESeq2 script from ###
@@ -1211,6 +1211,17 @@ setwd(file.path(work_dir))
12111211

12121212
```
12131213

1214+
**Input Data:**
1215+
1216+
1217+
* {GLDS-Accession-ID}_bulkRNASeq_v{version}_runsheet.csv (runsheet, output from [Step 9a](#9a-create-sample-runsheet))
1218+
* `organism` (name of organism samples were derived from, found in the species column of the [GL-DPPD-7110-A_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file)
1219+
1220+
**Output Data:**
1221+
1222+
* `runsheet_path` (variable containing path to runsheet created in [Step 9a](#9a-create-sample-runsheet))
1223+
* `annotations_link` (variable containing URL to GeneLab gene annotation table for the organism)
1224+
12141225
<br>
12151226

12161227
### 9c. Configure Metadata, Sample Grouping, and Group Comparisons
@@ -1265,6 +1276,16 @@ rm(contrast.names)
12651276

12661277
```
12671278

1279+
**Input Data:**
1280+
1281+
* `runsheet_path` (variable containing path to runsheet created in [Step 9a](#9a-create-sample-runsheet))
1282+
1283+
**Output Data:**
1284+
1285+
* `study` (data frame containing sample condition values)
1286+
* `group` (named vector indicating group membership for each sample)
1287+
* `contrasts` (matrix defining pairwise comparisons between groups)
1288+
12681289
<br>
12691290

12701291
### 9d. Import RSEM GeneCounts
@@ -1296,9 +1317,17 @@ if (dim(txi.rsem$counts)[2] != nrow(study)) {
12961317

12971318
### Add 1 to genes with lengths of zero - needed to make DESeqDataSet object ###
12981319
txi.rsem$length[txi.rsem$length == 0] <- 1
1299-
13001320
```
13011321

1322+
**Input Data:**
1323+
1324+
* `study` (data frame containing sample condition values, output from [Step 9c](#9c-create-study-group-and-contrasts))
1325+
1326+
**Output Data:**
1327+
1328+
* `txi.rsem` (list containing imported RSEM data including counts, transcript lengths, and gene lengths)
1329+
* `samples` (vector of sample names from study data frame)
1330+
13021331
<br>
13031332

13041333
### 9e. Perform DGE Analysis
@@ -1388,7 +1417,7 @@ if (length(grep("ERCC-", rownames(dds))) != 0) {
13881417
}
13891418

13901419
### Perform DESeq analysis ###
1391-
dds <- DESeq(dds, parallel = TRUE, BPPARAM = BPPARAM)
1420+
dds <- DESeq(dds, parallel = TRUE, BPPARAM = MulticoreParam(workers = 4))
13921421

13931422
### Generate normalized counts ###
13941423
normCounts <- as.data.frame(counts(dds, normalized = TRUE))
@@ -1403,6 +1432,20 @@ res_lrt <- results(dds_lrt)
14031432

14041433
```
14051434

1435+
**Input Data:**
1436+
1437+
* `group` (named vector indicating group membership for each sample, output from [Step 9c](#9c-configure-metadata-sample-grouping-and-group-comparisons))
1438+
* `txi.rsem` (imported RSEM data containing counts matrix, output from [Step 9d](#9d-import-rsem-genecounts))
1439+
1440+
**Output Data:**
1441+
1442+
* `dds` (DESeq2 data object containing normalized counts and experimental design)
1443+
* `normCounts` (data frame of normalized count values + 1)
1444+
* `VSTCounts` (data frame of variance stabilized transformed counts)
1445+
* `dds_lrt` (DESeq2 data object from likelihood ratio test)
1446+
* `res_lrt` (results object from likelihood ratio test)
1447+
* `sampleTable` (data frame containing sample condition information, with technical replicates handled)
1448+
14061449
<br>
14071450

14081451
### 9f. Add Statistics and Gene Annotations to DGE Results
@@ -1432,7 +1475,7 @@ compute_contrast <- function(i) {
14321475
}
14331476

14341477
### Use bplapply to compute results in parallel ###
1435-
res_list <- bplapply(1:dim(contrasts)[2], compute_contrast, BPPARAM = BPPARAM)
1478+
res_list <- bplapply(1:dim(contrasts)[2], compute_contrast, BPPARAM = MulticoreParam(workers = 4))
14361479

14371480
### Combine the list of data frames into a single data frame ###
14381481
res_df <- do.call(cbind, res_list)
@@ -1483,6 +1526,29 @@ output_table <- output_table %>%
14831526

14841527
```
14851528

1529+
**Input Data:**
1530+
1531+
* `normCounts` (data frame of normalized count values, output from [Step 9e](#9e-perform-dge-analysis))
1532+
* `res_lrt` (results object from likelihood ratio test, output from [Step 9e](#9e-perform-dge-analysis))
1533+
* `contrasts` (matrix defining pairwise comparisons, output from [Step 9c](#9c-configure-metadata-sample-grouping-and-group-comparisons))
1534+
* `annotations_link` (variable containing URL to GeneLab annotation table, output from [Step 9b](#9b-environment-set-up))
1535+
1536+
**Output Data:**
1537+
1538+
* `output_table` (data frame containing the following columns:
1539+
- Gene identifier column (ENSEMBL or TAIR for plant studies)
1540+
- Normalized counts for each sample
1541+
- Log2 fold change, test statistic, p-value and adjusted p-value for each pairwise comparison
1542+
- All.mean (mean across all samples)
1543+
- All.stdev (standard deviation across all samples)
1544+
- LRT.p.value (likelihood ratio test adjusted p-value)
1545+
- For each experimental group:
1546+
- Group.Mean_(group) (mean within group)
1547+
- Group.Stdev_(group) (standard deviation within group))
1548+
- Additional organism-specific gene annotations columns
1549+
1550+
<br>
1551+
14861552
### 9g. Export DGE Tables
14871553

14881554
```R
@@ -1517,19 +1583,36 @@ sessionInfo()
15171583

15181584

15191585
**Input Data:**
1520-
1521-
- *runsheet.csv file (table containing metadata required for analysis, output from [step 9a](#9a-create-sample-runsheet))
1522-
- [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) (csv file containing link to GeneLab annotations)
1523-
- *genes.results (RSEM counts per gene, output from [Step 8a](#8a-count-aligned-reads-with-rsem)) or *_rRNA_removed.genes.results (RSEM counts per gene with rRNA entries removed, output from [Step 8d](#8d-remove-rrna-from-rsem-counts))
1586+
* `contrasts` (matrix defining pairwise comparisons between experimental groups from [Step 9c](#9c-create-study-group-and-contrasts))
1587+
* `txi.rsem` (imported RSEM count data from [Step 9d](#9d-import-rsem-genecounts))
1588+
* `normCounts` (normalized count values from [Step 9e](#9e-perform-dge-analysis))
1589+
* `VSTCounts` (variance stabilized transformed counts from [Step 9e](#9e-perform-dge-analysis))
1590+
* `sampleTable` (data frame mapping samples to experimental conditions from [Step 9e](#9e-perform-dge-analysis))
1591+
* `output_table` (DGE output table from [Step 9f](#9f-add-statistics-and-gene-annotations-to-dge-results))
15241592

15251593
**Output Data:**
15261594

1527-
- **RSEM_Unnormalized_Counts_GLbulkRNAseq.csv** (table containing raw RSEM gene counts for each sample)
1528-
- **Normalized_Counts_GLbulkRNAseq.csv** (table containing normalized gene counts for each sample)
1529-
- **VST_Counts_GLbulkRNAseq.csv** (table containing VST normalized gene counts for each sample)
1530-
- **SampleTable_GLbulkRNAseq.csv** (table containing samples and their respective groups)
1531-
- **differential_expression_GLbulkRNAseq.csv** (table containing normalized counts for each sample, group statistics, DESeq2 DGE results for each pairwise comparison, and gene annotations)
1532-
- **contrasts_GLbulkRNAseq.csv** (table containing all pairwise comparisons)
1595+
* **RSEM_Unnormalized_Counts_GLbulkRNAseq.csv** (raw RSEM gene counts for all samples and technical replicates)
1596+
* **Normalized_Counts_GLbulkRNAseq.csv** (normalized gene counts)
1597+
* **VST_Counts_GLbulkRNAseq.csv** (variance stabilized transformed counts)
1598+
* **SampleTable_GLbulkRNAseq.csv** (1-column table defining sample-to-experimental-condition mapping:
1599+
- Row names: Sample names
1600+
- Column 'condition': Factor indicating experimental group/condition)
1601+
* **contrasts_GLbulkRNAseq.csv** (2-row table defining pairwise group comparisons:
1602+
- Column names: One pairwise group comparison (e.g., "GroupAvGroupB")
1603+
- Row 1 contains the numerator group
1604+
- Row 2 contains the denominator group)
1605+
* **differential_expression_GLbulkRNAseq.csv** (complete DGE results table containing the following columns:
1606+
- Gene identifier column (ENSEMBL or TAIR for plant studies)
1607+
- Normalized counts for each sample
1608+
- Log2 fold change, test statistic, p-value and adjusted p-value for each pairwise comparison
1609+
- All.mean (mean across all samples)
1610+
- All.stdev (standard deviation across all samples)
1611+
- LRT.p.value (likelihood ratio test adjusted p-value)
1612+
- For each experimental group:
1613+
- Group.Mean_(group) (mean within group)
1614+
- Group.Stdev_(group) (standard deviation within group)
1615+
- Additional organism-specific gene annotations columns)
15331616

15341617
> Note: Datasets with technical replicates are handled by collapsing them such that the minimum number of equal technical replicates is retained across all samples. Before normalization, the counts of technical replicates are summed to combine them into a single sample representing the biological replicate.
15351618

0 commit comments

Comments
 (0)