Skip to content

Commit cf3450b

Browse files
[NF_RCP] Miscellaneous fixes for NF_RCP 2.0.0 (#144)
* fix typos/bugs * fix vst norm counts file name * fix prokaryotes pipeline image * set align_star to 8 cpu default * add org version to protocol * [NF_RCP] Prokaryotes - update unmapped reads se/pe output in docs * Add docker notes post-processing: * update assay table updates * add raw multiqc col, remove non existent STAR rrnaRM file * fix duplicate col issue in update assay table * publish rrnarm counts * add rrnarm counts to assay table * change unmapped reads col name * fix pending columns * update assay table typo fix - featurecounts multiqc V&V: * feed trimming reports into trimmed reads multiqc, update vv * finish reads vv updates * fix adapter content check in vv * fix vv rsem script * fix vv dge log2fc check Output organization: * update docs to reflect trimmed multiqc update * fix dge script - account for rrnarma changes * add fixes for RNAseq outputs * chmod +x script * remove rrnarm specific output folders, move duplicate sample table and contrasts table * move raw and trimmed multiqcs to dedicated folder * move alignment multiqc reports to dedicated folder * move counts multiqc reports to dedicated output folder * move rseqc multiqcs to dedicated folder * fix typo on trimmed data sequence / fastq * add md5sum file filters
1 parent f646141 commit cf3450b

25 files changed

+1012
-403
lines changed

RNAseq/Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-G.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ DESeq2 Analysis Workflow
8989

9090
- Added parallel rRNA-removed DGE analysis:
9191
- Create filtered RSEM count files with rRNA features removed:
92-
- {sample}_rRNA_removed.genes.results
92+
- {sample}_rRNArm.genes.results
9393
- Normalize rRNA-removed counts
9494
- Perform DGE analysis using rRNA-removed counts
9595
- Output additional set of rRNA-removed counts and DGE results
@@ -310,7 +310,7 @@ fastqc -o /path/to/trimmed_fastqc/output/directory *.fastq.gz
310310
### 2c. Compile Trimmed Data QC
311311

312312
```bash
313-
multiqc --interactive -n trimmed_multiqc_GLbulkRNAseq -o /path/to/trimmed_multiqc/output/directory /path/to/directory/containing/trimmed_fastqc/files
313+
multiqc --interactive -n trimmed_multiqc_GLbulkRNAseq -o /path/to/trimmed_multiqc/output/directory /path/to/directory/containing/trimmed_fastqc/files /path/to/directory/containing/trimming_reports
314314

315315
zip -r trimmed_multiqc_GLbulkRNAseq_data.zip trimmed_multiqc_GLbulkRNAseq_data
316316
```
@@ -320,11 +320,13 @@ zip -r trimmed_multiqc_GLbulkRNAseq_data.zip trimmed_multiqc_GLbulkRNAseq_data
320320
- `--interactive` – force reports to use interactive plots
321321
- `-n` – prefix name for output files
322322
- `-o` – the output directory to store results
323-
- `/path/to/directory/containing/trimmed_fastqc/files` – the directory holding the output data from the FastQC run, provided as a positional argument
323+
- `/path/to/directory/containing/trimmed_fastqc/files` – the directory holding the output data from the fastqc run, provided as a positional argument
324+
- `/path/to/directory/containing/trimming_reports` – the directory containing the trimming reports from the trim/filter step, provided as a positional argument
324325

325326
**Input Data:**
326327

327328
- *fastqc.zip (FastQC data, output from [Step 2b](#2b-trimmed-data-qc))
329+
- *trimming_report.txt (trimming report, output from [Step 2a](#2a-trimfilter-raw-data))
328330

329331
**Output Data:**
330332

@@ -471,7 +473,7 @@ STAR --twopassMode Basic \
471473
- SJ.out.tab
472474
- *_STARtmp (directory containing the following:)
473475
- BAMsort (directory containing subdirectories that are empty – this was the location for temp files that were automatically removed after successful completion)
474-
- **\*Unmapped.out.mate1.fastq.gz, \*Unmapped.out.mate2.fastq.gz** (unmapped and partially mapped reads in fastq format)
476+
- **\*Unmapped.out.mate1, \*Unmapped.out.mate2** (unmapped and partially mapped reads in fastq format)
475477

476478
<br>
477479

@@ -1074,7 +1076,7 @@ grep -E 'gene_biotype "rRNA"|gene_type "rRNA"|gbkey "rRNA"' /path/to/annotation/
10741076
### Filter out rRNA entries ###
10751077
awk 'NR==FNR {ids[$1]=1; next} !($1 in ids)' \
10761078
*rrna_ensembl_ids.txt \
1077-
*.genes.results > *_rRNA_removed.genes.results
1079+
*.genes.results > *_rRNArm.genes.results
10781080

10791081
### Count removed rRNA entries ###
10801082
rRNA_count=$(awk 'NR==FNR {ids[$1]=1; next} $1 in ids' \
@@ -1088,7 +1090,7 @@ echo "*: ${rRNA_count} rRNA entries removed." > *_rRNA_counts.txt
10881090
- *rrna_ensembl_ids.txt (file containing list of gene IDs with rRNA features, output from [Step 8di](#8di-extract-rrna-gene-ids-from-gtf))
10891091

10901092
**Output Data:**
1091-
- **\*rRNA_removed.genes.results** (RSEM gene counts with rRNA entries removed)
1093+
- **\*rRNArm.genes.results** (RSEM gene counts with rRNA entries removed)
10921094
- *rRNA_counts.txt (Summary of number of rRNA entries removed)
10931095

10941096
<br>
@@ -1099,7 +1101,7 @@ echo "*: ${rRNA_count} rRNA entries removed." > *_rRNA_counts.txt
10991101

11001102
> Note: DGE Analysis is performed twice with different sets of input files:
11011103
> 1. Using RSEM genes.results files (*genes.results, output from [Step 8a](#8a-count-aligned-reads-with-rsem))
1102-
> 2. Using rRNA-removed RSEM genes.results files (*rRNA_removed.genes.results, output from [Step 8dii](#8dii-filter-rrna-genes-from-rsem-genes-results))
1104+
> 2. Using rRNA-removed RSEM genes.results files (*rRNArm.genes.results, output from [Step 8dii](#8dii-filter-rrna-genes-from-rsem-genes-results))
11031105
11041106
<br>
11051107

RNAseq/Pipeline_GL-DPPD-7115_Versions/GL-DPPD-7115.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,7 @@ fastqc -o /path/to/trimmed_fastqc/output/directory *.fastq.gz
246246
### 2c. Compile Trimmed Data QC
247247

248248
```bash
249-
multiqc --interactive -n trimmed_multiqc_GLbulkRNAseq -o /path/to/trimmed_multiqc/output/directory /path/to/directory/containing/trimmed_fastqc/files
249+
multiqc --interactive -n trimmed_multiqc_GLbulkRNAseq -o /path/to/trimmed_multiqc/output/directory /path/to/directory/containing/trimmed_fastqc/files /path/to/directory/containing/trimming_reports
250250

251251
zip -r trimmed_multiqc_GLbulkRNAseq_data.zip trimmed_multiqc_GLbulkRNAseq_data
252252
```
@@ -257,10 +257,12 @@ zip -r trimmed_multiqc_GLbulkRNAseq_data.zip trimmed_multiqc_GLbulkRNAseq_data
257257
- `-n` – prefix name for output files
258258
- `-o` – the output directory to store results
259259
- `/path/to/directory/containing/trimmed_fastqc/files` – the directory holding the output data from the fastqc run, provided as a positional argument
260+
- `/path/to/directory/containing/trimming_reports` – the directory containing the trimming reports from the trim/filter step, provided as a positional argument
260261

261262
**Input Data:**
262263

263264
- *fastqc.zip (FastQC data, output from [Step 2b](#2b-trimmed-data-qc))
265+
- *trimming_report.txt (trimming report, output from [Step 2a](#2a-trimfilter-raw-data))
264266

265267
**Output Data:**
266268

@@ -350,7 +352,9 @@ bowtie2 -x /path/to/bowtie2/index \
350352

351353
- *\.sam (alignments in SAM format)
352354
- **\*.bowtie2.log** (log file containing alignment statistics)
353-
- **\*.unmapped.fastq.gz** (unmapped reads in FASTQ format)
355+
- Unmapped reads (unmapped reads in FASTQ format)
356+
- **\*.unmapped.fastq.gz** (single-end)
357+
- **\*.unmapped.fastq.1.gz, .unmapped.fastq.2.gz** (paired-end)
354358

355359
<br>
356360

RNAseq/Workflow_Documentation/NF_RCP/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,8 @@ We recommend installing Singularity on a system wide level as per the associated
104104
105105
> Note: Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity).
106106
107+
> Note: Alternatively, Docker can be used in place of Singularity. See the [Docker CE installation documentation](https://docs.docker.com/engine/install/).
108+
107109
<br>
108110
109111
---
@@ -150,6 +152,8 @@ export NXF_SINGULARITY_CACHEDIR=$(pwd)/singularity
150152
While in the location containing the `NF_RCP_2.0.0` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow. Below are four examples of how to run the NF_RCP workflow:
151153
> Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --reference_version) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.
152154
155+
> Note: To use Docker instead of Singularity, use `-profile docker` in the Nextflow run command. Nextflow will automatically pull images as needed.
156+
153157
<br>
154158
155159
#### 4a. Approach 1: Run the workflow on a GeneLab RNAseq dataset with automatic retrieval of reference fasta and gtf files

RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/dge_deseq2.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -137,8 +137,8 @@ if (params$microbes) {
137137
full.names = TRUE
138138
)
139139
140-
# Remove "_rRNA_removed" from filenames for matching
141-
clean_filenames <- sub("_rRNA_removed", "", basename(files))
140+
# Remove "_rRNArm" from filenames for matching
141+
clean_filenames <- sub("_rRNArm", "", basename(files))
142142
143143
samples <- rownames(study)
144144
@@ -331,7 +331,7 @@ write.csv(
331331
write.csv(
332332
VSTCounts,
333333
file = file.path(params$output_directory,
334-
paste0("VST_Normalized_Counts",
334+
paste0("VST_Counts",
335335
params$output_filename_label, params$output_filename_suffix, ".csv"))
336336
)
337337
```

RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/generate_md5sums.py

Lines changed: 66 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,12 @@ def calculate_md5(filepath):
2424
return "ERROR"
2525

2626
def is_raw_file(filepath):
27-
"""Check if file is a raw FASTQ or raw multiqc file."""
27+
"""Check if file is a raw FASTQ or raw multiqc file (zip or html)."""
2828
# Match raw fastq files but not trimming reports
2929
if "/Fastq/" in filepath and filepath.endswith("raw.fastq.gz") and "_raw.fastq.gz_" not in filepath:
3030
return True
31-
# Match raw multiqc reports
32-
if "raw_multiqc" in filepath and filepath.endswith(".zip"):
31+
# Match raw multiqc reports (zip or html)
32+
if "raw_multiqc" in filepath and (filepath.endswith(".zip") or filepath.endswith(".html")):
3333
return True
3434
return False
3535

@@ -38,11 +38,57 @@ def should_include(filepath, outdir):
3838
# Skip files in VV_Logs
3939
if "/VV_Logs/" in filepath:
4040
return False
41-
4241
# Skip files in GeneLab except for qc_metrics
4342
if "/GeneLab/" in filepath and not filepath.endswith("qc_metrics" + args.assay_suffix + ".csv"):
4443
return False
45-
44+
# Skip any files with 'fastqc' in the path or filename (case-insensitive)
45+
if 'fastqc' in filepath.lower():
46+
return False
47+
# STAR: Only keep specific outputs
48+
star_keep_patterns = [
49+
re.compile(r"_Aligned\.toTranscriptome\.out\.bam$"),
50+
re.compile(r"_Log\.final\.out$"),
51+
re.compile(r"_SJ\.out\.tab$"),
52+
re.compile(r"_Unmapped\.out\.mate1$"),
53+
re.compile(r"_Unmapped\.out\.mate2$")
54+
]
55+
basename = os.path.basename(filepath)
56+
if any(pat.search(basename) for pat in star_keep_patterns):
57+
return True
58+
# If it's a STAR output but not in the keep list, filter it out
59+
star_output_keywords = [
60+
"Aligned.sortedByCoord.out.bam", "ReadsPerGene.out.tab", "Log.out", "Log.progress.out"
61+
]
62+
if any(keyword in basename for keyword in star_output_keywords):
63+
return False
64+
# Skip fixed STAR output files (case-insensitive)
65+
fixed_star_files = [
66+
"Log.final.out", "SJ.out.tab", "sjdbList.out.tab", "sjdbInfo.txt"
67+
]
68+
if basename.lower() in [f.lower() for f in fixed_star_files]:
69+
return False
70+
# RSeQC: Only keep MultiQC reports (zip or html)
71+
if "RSeQC_Analyses" in filepath:
72+
if "MultiQC_Reports" not in filepath:
73+
return False
74+
# In MultiQC_Reports, only allow .zip or .html
75+
if not (filepath.endswith('.zip') or filepath.endswith('.html')):
76+
return False
77+
# RSEM: Only keep .genes.results and .isoforms.results (including _rRNArm variants)
78+
if basename.endswith('.genes.results') or basename.endswith('.isoforms.results'):
79+
return True
80+
rsem_other_patterns = ['.cnt', '.model', '.theta']
81+
if any(basename.endswith(ext) for ext in rsem_other_patterns):
82+
return False
83+
# Skip ISA.zip
84+
if filepath.endswith("ISA.zip"):
85+
return False
86+
# Skip STAR_NumNonZeroGenes_GLbulkRNAseq.csv and RSEM_NumNonZeroGenes_GLbulkRNAseq.csv
87+
if os.path.basename(filepath) in [
88+
"STAR_NumNonZeroGenes_GLbulkRNAseq.csv",
89+
"RSEM_NumNonZeroGenes_GLbulkRNAseq.csv"
90+
]:
91+
return False
4692
return True
4793

4894
def main():
@@ -100,5 +146,20 @@ def main():
100146
print(f"Added {raw_count} files to {raw_md5_file}")
101147
print(f"Added {processed_count} files to {processed_md5_file}")
102148

149+
def dedup_file(filename):
150+
seen = set()
151+
lines = []
152+
with open(filename, 'r') as f:
153+
for line in f:
154+
key = line.split('\t', 1)[0] # dedup by basename
155+
if key not in seen:
156+
seen.add(key)
157+
lines.append(line)
158+
with open(filename, 'w') as f:
159+
f.writelines(lines)
160+
161+
dedup_file(raw_md5_file)
162+
dedup_file(processed_md5_file)
163+
103164
if __name__ == "__main__":
104165
main()

RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/generate_protocol.py

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -277,14 +277,14 @@ def generate_protocol_content(args, software_versions):
277277

278278
# Define organism to annotation package mapping using scientific names
279279
organism_annotation_packages = {
280-
"Arabidopsis thaliana": ("org.At.tair.db", "3.19.1"),
281-
"Caenorhabditis elegans": ("org.Ce.eg.db", "3.19.1"),
282-
"Drosophila melanogaster": ("org.Dm.eg.db", "3.19.1"),
283-
"Danio rerio": ("org.Dr.eg.db", "3.19.1"),
284-
"Homo sapiens": ("org.Hs.eg.db", "3.19.1"),
285-
"Mus musculus": ("org.Mm.eg.db", "3.19.1"),
286-
"Rattus norvegicus": ("org.Rn.eg.db", "3.19.1"),
287-
"Saccharomyces cerevisiae": ("org.Sc.sgd.db", "3.19.1")
280+
"arabidopsis_thaliana": ("org.At.tair.db", "3.19.1"),
281+
"caenorhabditis_elegans": ("org.Ce.eg.db", "3.19.1"),
282+
"drosophila_melanogaster": ("org.Dm.eg.db", "3.19.1"),
283+
"danio_rerio": ("org.Dr.eg.db", "3.19.1"),
284+
"homo_sapiens": ("org.Hs.eg.db", "3.19.1"),
285+
"mus_musculus": ("org.Mm.eg.db", "3.19.1"),
286+
"rattus_norvegicus": ("org.Rn.eg.db", "3.19.1"),
287+
"saccharomyces_cerevisiae": ("org.Sc.sgd.db", "3.19.1")
288288
}
289289

290290
# List of organisms that use custom annotation packages via AnnotationForge
@@ -318,7 +318,7 @@ def generate_protocol_content(args, software_versions):
318318
# Format the organism name for lookup
319319
organism_formatted = ""
320320
if hasattr(args, 'organism') and args.organism:
321-
organism_formatted = args.organism.replace('_', ' ').title()
321+
organism_formatted = args.organism.replace(' ', '_').replace('-', '_').lower()
322322

323323
# Build gene annotations sentence
324324
gene_annotations_text = f"Gene annotations were assigned using the custom annotation tables generated in-house as detailed in GL-DPPD-7110-A (https://github.com/nasa/GeneLab_Data_Processing/blob/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md), with STRINGdb (version {stringdb_version}) and PANTHER.db (version {pantherdb_version})"

RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/parse_multiqc.py

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -86,15 +86,30 @@ def main(osd_num, paired_end, assay_suffix, mode):
8686
'num_uniquely_aligned', 'pct_uniquely_aligned', 'pct_multi_aligned', 'pct_filtered', 'pct_unalignable'
8787
]
8888

89+
# Make a set of fieldnames for fast lookup
90+
fieldnames_set = set(fieldnames)
91+
8992
output_filename = f'qc_metrics{assay_suffix}.csv'
9093
with open(output_filename, mode='w', newline='') as f:
9194
writer = csv.DictWriter(f, fieldnames=fieldnames)
9295
writer.writeheader()
9396

9497
for sample in samples:
95-
fields = {k: v for d in [i[sample] for i in multiqc_data if sample in i] for k, v in d.items()}
96-
97-
writer.writerow({'osd_num': 'OSD-' + osd_num, 'sample': sample, **metadata, **fields})
98+
# Collect all fields for this sample
99+
all_fields = {}
100+
for data_source in multiqc_data:
101+
if sample in data_source:
102+
for k, v in data_source[sample].items():
103+
# Only keep fields that are in the fieldnames list
104+
if k in fieldnames_set:
105+
all_fields[k] = v
106+
else:
107+
# Optionally add debug output to see which fields are being skipped
108+
# print(f"Skipping field not in fieldnames: {k}")
109+
pass
110+
111+
# Write the row with filtered fields
112+
writer.writerow({'osd_num': 'OSD-' + osd_num, 'sample': sample, **metadata, **all_fields})
98113

99114

100115
def get_metadata(osd_num):

RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/sort_into_subdirectories_by_sample.py

100644100755
File mode changed.

0 commit comments

Comments
 (0)