Skip to content

Commit c773c39

Browse files
Merge pull request #139 from torres-alexis/DEV_RNAseq_vG_CURRENT
RNAseq Updates
2 parents dd68d0c + 2f444cf commit c773c39

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+2386
-1107
lines changed

RNAseq/Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-G.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
55
---
66

7-
**Date:** January 28, 2025 [CHANGE TO BASELINE DATE]
7+
**Date:** February 19, 2025
88
**Revision:** G
99
**Document Number:** GL-DPPD-7101-G
1010

@@ -1113,10 +1113,10 @@ echo "*: ${rRNA_count} rRNA entries removed." > *_rRNA_counts.txt
11131113

11141114
### 9a. Create Sample RunSheet
11151115

1116-
> Note: Rather than running the command below to create the runsheet needed for processing, the runsheet may also be created manually by following the [file specification](../Workflow_Documentation/NF_RCP-F/examples/runsheet/README.md).
1116+
> Note: Rather than running the command below to create the runsheet needed for processing, the runsheet may also be created manually by following the [file specification](../Workflow_Documentation/NF_RCP/examples/runsheet/README.md).
11171117
11181118
```bash
1119-
### Download the *ISA.zip file from the GeneLab Repository ###
1119+
### Download the *ISA.zip file from the Open Science Data Repository ###
11201120

11211121
dpt-get-isa-archive \
11221122
--accession GLDS-###
@@ -1144,7 +1144,7 @@ dpt-isa-to-runsheet --accession GLDS-### \
11441144

11451145
**Output Data:**
11461146

1147-
- *ISA.zip (compressed ISA directory containing Investigation, Study, and Assay (ISA) metadata files for the respective GLDS dataset, used to define sample groups - the *ISA.zip file is located in the [OSDR repository]([https://genelab-data.ndc.nasa.gov/genelab/projects](https://osdr.nasa.gov/bio/repo/)) under 'Files' -> 'Study Metadata Files')
1147+
- *ISA.zip (compressed ISA directory containing Investigation, Study, and Assay (ISA) metadata files for the respective GLDS dataset, used to define sample groups - the *ISA.zip file is located in the [OSDR repository](https://osdr.nasa.gov/bio/repo/) under 'Files' -> 'Study Metadata Files')
11481148

11491149
- **{GLDS-Accession-ID}_bulkRNASeq_v{version}_runsheet.csv** (table containing metadata required for processing, version denotes the dp_tools schema used to specify the metadata to extract from the ISA archive)
11501150

@@ -1192,7 +1192,7 @@ organism <- "organism_that_samples_were_derived_from"
11921192

11931193
runsheet_path="/path/to/directory/containing/runsheet.csv/file" ## This is the runsheet created in Step 9a above
11941194
work_dir="/path/to/working/directory/where/script/is/executed/from"
1195-
counts_dir="/path/to/directory/containing/RSEM/counts/files"
1195+
input_counts="/path/to/directory/containing/RSEM/counts/files"
11961196
norm_output="/path/to/normalized/counts/output/directory"
11971197
DGE_output="/path/to/DGE/output/directory"
11981198

@@ -1297,7 +1297,7 @@ rm(contrast.names)
12971297
```R
12981298
### Import RSEM gene count data ###
12991299
files <- list.files(
1300-
path = counts_dir,
1300+
path = input_counts,
13011301
pattern = ".genes.results",
13021302
full.names = TRUE
13031303
)
@@ -1592,9 +1592,10 @@ sessionInfo()
15921592

15931593

15941594
**Input Data:**
1595-
* `sampleTable` (data frame mapping samples to groups, output from [Step 9e](#9e-perform-dge-analysis))
1596-
* `contrasts` (matrix defining pairwise comparisons between groups, output from [Step 9c](#9c-create-study-group-and-contrasts))
1595+
1596+
* `contrasts` (matrix defining pairwise comparisons between groups, output from [Step 9c](#9c-configure-metadata-sample-grouping-and-group-comparisons))
15971597
* `txi.rsem` (imported RSEM count data, output from [Step 9d](#9d-import-rsem-genecounts))
1598+
* `sampleTable` (data frame mapping samples to groups, output from [Step 9e](#9e-perform-dge-analysis))
15981599
* `normCounts` (normalized counts, output from [Step 9e](#9e-perform-dge-analysis))
15991600
* `VSTCounts` (variance stabilized transformed counts, output from [Step 9e](#9e-perform-dge-analysis))
16001601
* `output_table` (DGE output table, output from [Step 9f](#9f-add-statistics-and-gene-annotations-to-dge-results))

RNAseq/Pipeline_GL-DPPD-7115_Versions/GL-DPPD-7115.md

Lines changed: 437 additions & 282 deletions
Large diffs are not rendered by default.

RNAseq/Workflow_Documentation/NF_RCP/README.md

Lines changed: 25 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -144,23 +144,23 @@ While in the location containing the `NF_RCP-G_2.0.0` directory that was downloa
144144
```bash
145145
nextflow run NF_RCP-G_2.0.0/main.nf \
146146
-profile singularity \
147-
--gldsAccession OSD-194
147+
--accession OSD-194
148148
```
149149
150150
<br>
151151
152152
#### 4b. Approach 2: Run the workflow on a GeneLab RNAseq dataset using local reference fasta and gtf files
153153
154-
> Note: The `--ref_source` and `--ensemblVersion` parameters should match the reference source and version number of the local reference fasta and gtf files used
154+
> Note: The `--reference_source` and `--reference_version` parameters should match the reference source and version number of the local reference fasta and gtf files used
155155
156156
```bash
157157
nextflow run NF_RCP-G_2.0.0/main.nf \
158158
-profile singularity \
159-
--gldsAccession OSD-194 \
160-
--ensemblVersion 107 \
161-
--ref_source ensembl \
162-
--ref_fasta </path/to/fasta> \
163-
--ref_gtf </path/to/gtf>
159+
--accession OSD-194 \
160+
--reference_version 107 \
161+
--reference_source ensembl \
162+
--reference_fasta </path/to/fasta> \
163+
--reference_gtf </path/to/gtf>
164164
```
165165
166166
<br>
@@ -172,8 +172,8 @@ nextflow run NF_RCP-G_2.0.0/main.nf \
172172
```bash
173173
nextflow run NF_RCP-G_2.0.0/main.nf \
174174
-profile singularity \
175-
--gldsAccession output_directory \
176-
--runsheetPath </path/to/runsheet>
175+
--accession output_directory \
176+
--runsheet_path </path/to/runsheet>
177177
```
178178
179179
<br>
@@ -184,43 +184,39 @@ nextflow run NF_RCP-G_2.0.0/main.nf \
184184
185185
* `-profile` - Specifies the configuration profile(s) to load, `singularity` instructs Nextflow to setup and use singularity for all software called in the workflow
186186
187-
* `--gldsAccession OSD-###` – specifies the OSD dataset to process through the RCP workflow (replace ### with the OSD number)
188-
> Note: The primary output directory will be titled "OSD-###"
189-
190-
* `--gldsAccession output_directory` – specifies the output directory name to use when processing a non-OSD dataset, as indicated in [Approach 3 above](#4c-approach-3-run-the-workflow-on-a-non-glds-dataset-using-a-user-created-runsheet)
187+
* `--accession [OSD-###|GLDS-###]` – specifies the OSDR dataset to process through the RCP workflow (replace ### with the OSD or GLDS number)
188+
> Note: The primary output directory will be named after the accession input, e.g. "OSD-194" or "GLDS-194"
191189
192190
193191
<br>
194192
195193
**Additional Required Parameters For [Approach 2](#4b-approach-2-run-the-workflow-on-a-genelab-rnaseq-dataset-using-local-ensembl-reference-fasta-and-gtf-files):**
196194
197-
* `--ensemblVersion` - specifies the Ensembl version to use for the reference genome (Ensembl release `107` is used in this example)
195+
* `--reference_version` - specifies the Ensembl version to use for the reference genome (Ensembl release `107` is used in this example)
198196
199-
* `--ref_source` - specifies the source of the reference files used (the source indicated in the Approach 2 example is `ensembl`)
197+
* `--reference_source` - specifies the source of the reference files used (the source indicated in the Approach 2 example is `ensembl`)
200198
201-
* `--ref_fasta` - specifices the path to a local fasta file
199+
* `--reference_fasta` - specifices the path to a local fasta file
202200
203-
* `--ref_gtf` - specifices the path to a local gtf file
201+
* `--reference_gtf` - specifices the path to a local gtf file
204202
205-
> Note: If the local reference files specified are different than the Ensembl reference files used to create the [GeneLab annotations table](https://github.com/nasa/GeneLab_Data_Processing/blob/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv), additional gene annotations associated with any Ensembl/TAIR IDs from the specified files that are not shared in the GeneLab annotations will not be added to the DGE output table(s).
203+
> Note: If the local reference files specified are different than the reference files used to create the [GeneLab annotations table](https://github.com/nasa/GeneLab_Data_Processing/blob/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv), additional gene annotations associated with any gene IDs from the specified files that are not shared in the GeneLab annotations will not be added to the DGE output table(s).
206204
207205
<br>
208206
209207
**Optional Parameters:**
210208
211-
* `--skipVV` - skip the automated V&V processes (Default: the automated V&V processes are active)
209+
* `--skip_vv` - skip the automated V&V processes (Default: the automated V&V processes are active)
212210
213-
* `--outputDir` - specifies the directory to save the raw and processed data files (Default: files are saved in the launch directory)
211+
* `--outdir` - specifies the directory to save the raw and processed data files (Default: files are saved in a folder named `results` created in the launch directory)
214212
215213
* `--force_single_end` - forces the analysis to use single end processing; for paired end datasets, this means only R1 is used; for single end datasets, this should have no effect
216214
217-
* `--stageLocal TRUE|FALSE` - TRUE = download the raw reads files for the OSD dataset indicated, FALSE = disable raw reads download and processing (Default: TRUE)
218-
219-
* `--referenceStorePath` - specifies the directory to store the Ensembl fasta and gtf files (Default: within the directory structure created by default in the launch directory)
215+
* `--reference_store_path` - specifies the directory to store the Ensembl fasta and gtf files (Default: within the directory structure created by default in the launch directory)
220216
221-
* `--derivedStorePath` - specifies the directory to store the tool-specific indices created during processing (Default: within the directory structure created by default in the launch directory)
217+
* `--derived_store_path` - specifies the directory to store the tool-specific indices created during processing (Default: within the directory structure created by default in the launch directory)
222218
223-
* `--runsheetPath` - specifies the path to a local runsheet (Default: a runsheet is automatically generated using the metadata on the GeneLab Repository for the OSD dataset being processed)
219+
* `--runsheet_path` - specifies the path to a local runsheet (Default: a runsheet is automatically generated using the metadata on the GeneLab Repository for the OSD dataset being processed)
224220
> This is required when prcessing a non-OSD dataset as indicated in [Approach 3 above](#4c-approach-3-run-the-workflow-on-a-non-glds-dataset-using-a-user-created-runsheet)
225221
226222
<br>
@@ -272,8 +268,9 @@ Standard Nextflow resource usage logs are also produced as follows:
272268
**Nextflow Resource Usage Logs**
273269
274270
- Output:
275-
- Resource_Usage/execution_report_{timestamp}.html (an html report that includes metrics about the workflow execution including computational resources and exact workflow process commands)
276-
- Resource_Usage/execution_timeline_{timestamp}.html (an html timeline for all processes executed in the workflow)
277-
- Resource_Usage/execution_trace_{timestamp}.txt (an execution tracing file that contains information about each process executed in the workflow, including: submission time, start time, completion time, cpu and memory used, machine-readable output)
271+
- nextflow_logs/execution_report_{timestamp}.html (an html report that includes metrics about the workflow execution including computational resources and exact workflow process commands)
272+
- nextflow_logs/execution_timeline_{timestamp}.html (an html timeline for all processes executed in the workflow)
273+
- nextflow_logs/execution_trace_{timestamp}.txt (an execution tracing file that contains information about each process executed in the workflow, including: submission time, start time, completion time, cpu and memory used, machine-readable output)
274+
- nextflow_info/pipeline_dag_{timestamp}.html (a visualization of the workflow process DAG)
278275
279276
<br>

RNAseq/Workflow_Documentation/NF_RCP/examples/runsheet/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
| Sample Name | string | Sample Name, added as a prefix to sample-specific processed data output files. Should not include spaces or weird characters. | Mmus_BAL-TAL_LRTN_BSL_Rep1_B7 |
1919
| has_ERCC | bool | Set to True if ERCC spike-ins are included in the samples. This ensures ERCC normalized DGE is performed in addition to standard DGE. | True |
2020
| paired_end | bool | Set to True if the samples were sequenced as paired-end. If set to False, samples are assumed to be single-end. | False |
21-
| organism | string | Species name used to map to the appropriate gene annotations file. Supported species can be found in the `species` column of the [GL-DPPD-7110_annotations.csv](../../../../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) file. | Mus musculus |
21+
| organism | string | Species name used to map to the appropriate gene annotations file. Supported species can be found in the `species` column of the [GL-DPPD-7110-A_annotations.csv](../../../../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file. | Mus musculus |
2222
| read1_path | string (url or local path) | Location of the raw reads file. For paired-end data, this specifies the forward reads fastq.gz file. | /my/data/sample_1.fastq.gz |
2323
| read2_path | string (url or local path) | Location of the raw reads file. For paired-end data, this specifies the reverse reads fastq.gz file. For single-end data, this column should be omitted. | /my/data/sample_2.fastq.gz |
2424
| Factor Value[<name, e.g. Spaceflight>] | string | A set of one or more columns specifying the experimental group the sample belongs to. In the simplest form, a column named 'Factor Value[group]' is sufficient. | Space Flight |

RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/assess_strandedness.py

100644100755
File mode changed.

0 commit comments

Comments
 (0)