Skip to content

Commit e2a7ff2

Browse files
authored
Merge pull request #148 from torres-alexis/null_annotations_bug_fix
Misc fixes
2 parents d5dc4ed + d178627 commit e2a7ff2

30 files changed

+675
-486
lines changed

RNAseq/Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-G.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -59,9 +59,7 @@ Software Updates:
5959
| scipy | 1.9.1 | 1.15.1 |
6060

6161
STAR Alignment
62-
- Added unaligned reads FASTQ output file(s) via STAR `-outReadsUnmapped Fastq`:
63-
- {sample}_Unmapped.out.mate1
64-
- {sample}_Unmapped.out.mate2
62+
- Added unaligned reads FASTQ output file(s) via STAR `-outReadsUnmapped Fastx`
6563

6664
RSeQC Analysis
6765
- Updated inner_distance.py invocation to use a lower minimum value to account for longer read lengths
@@ -420,9 +418,14 @@ STAR --twopassMode Basic \
420418
--outSAMheaderHD @HD VN:1.4 SO:coordinate \
421419
--outFileNamePrefix /path/to/STAR/output/directory/<sample_id> \
422420
--outReadsUnmapped Fastx \
421+
--genomeLoad NoSharedMemory \
423422
--readFilesIn /path/to/trimmed_forward_reads \
424423
/path/to/trimmed_reverse_reads # only needed for PE studies
425424

425+
mv <sample_id>_Unmapped.out.mate1 <sample_id>_R1_unmapped.fastq # Only needed for PE studies
426+
mv <sample_id>_Unmapped.out.mate2 <sample_id>_R2_unmapped.fastq # Only needed for PE studies
427+
# mv <sample_id>_Unmapped.out.mate1 <sample_id>_unmapped.fastq # Only needed for SE studies
428+
gzip *_unmapped.fastq
426429
```
427430

428431
**Parameter Definitions:**
@@ -448,7 +451,8 @@ STAR --twopassMode Basic \
448451
- `--quantMode` – specifies the type(s) of quantification desired; the `TranscriptomeSAM` option instructs STAR to output a separate sam/bam file containing alignments to the transcriptome and the `GeneCounts` option instructs STAR to output a tab delimited file containing the number of reads per gene
449452
- `--outSAMheaderHD` – indicates a header line for the sam/bam file
450453
- `--outFileNamePrefix` – specifies the path to and prefix for the output file names; for GeneLab the prefix is the sample id
451-
- `outReadsUnmapped` - specifies how to output unmapped and partially mapped reads (where only one mate of a paired-end read is mapped); the `Fastx` option outputs unmapped reads in separate fasta/fastq files named Unmapped.out.mate1 and Unmapped.out.mate2
454+
- `outReadsUnmapped` - specifies how to output unmapped and partially mapped reads (where only one mate of a paired-end read is mapped); the `Fastx` option outputs unmapped reads in separate fastq files
455+
- `--genomeLoad` – controls how the genome index is loaded into memory; `NoSharedMemory` specifies that each job will have its own private copy of the genome rather than using shared memory. This is the only option compatible with `--twopassMode Basic`.
452456
- `--readFilesIn` – path to input read 1 (forward read) and read 2 (reverse read); for paired-end reads, read 1 and read 2 should be separated by a space; for single-end reads only read 1 should be indicated
453457

454458
**Input Data:**
@@ -473,7 +477,7 @@ STAR --twopassMode Basic \
473477
- SJ.out.tab
474478
- *_STARtmp (directory containing the following:)
475479
- BAMsort (directory containing subdirectories that are empty – this was the location for temp files that were automatically removed after successful completion)
476-
- **\*Unmapped.out.mate1, \*Unmapped.out.mate2** (unmapped and partially mapped reads in fastq format)
480+
- **\*unmapped.fastq.gz** (unmapped and partially mapped reads)
477481

478482
<br>
479483

RNAseq/Pipeline_GL-DPPD-7115_Versions/GL-DPPD-7115.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -327,6 +327,11 @@ bowtie2 -x /path/to/bowtie2/index \
327327
# --un-gz <sample_id>.unmapped.fastq.gz \ # For single-end data
328328
-S /path/to/bowtie2/output/directory/<sample_id>.sam \
329329
2> /path/to/bowtie2/output/directory/<sample_id>.bowtie2.log
330+
331+
# Rename unmapped reads
332+
mv <sample_id>.unmapped.fastq.1.gz <sample_id>_R1_unmapped.fastq.gz # For paired-end data
333+
mv <sample_id>.unmapped.fastq.2.gz <sample_id>_R2_unmapped.fastq.gz # For paired-end data
334+
# mv <sample_id>.unmapped.fastq.gz <sample_id>_unmapped.fastq.gz # For single-end data
330335
```
331336

332337
**Parameter Definitions:**
@@ -352,9 +357,7 @@ bowtie2 -x /path/to/bowtie2/index \
352357

353358
- *\.sam (alignments in SAM format)
354359
- **\*.bowtie2.log** (log file containing alignment statistics)
355-
- Unmapped reads (unmapped reads in FASTQ format)
356-
- **\*.unmapped.fastq.gz** (single-end)
357-
- **\*.unmapped.fastq.1.gz, .unmapped.fastq.2.gz** (paired-end)
360+
- **\*unmapped.fastq.gz** (unmapped and partially mapped reads)
358361

359362
<br>
360363

RNAseq/Workflow_Documentation/NF_RCP/CHANGELOG.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2020
- Separate results are generated for rRNA-removed DGE analysis, with new output directories:
2121
- `04-DESeq2_NormCounts_rRNArm/`
2222
- `05-DESeq2_DGE_rRNArm/`
23+
- Added reference table support for Pseudomonas aeruginosa [#37](https://github.com/nasa/GeneLab_Data_Processing/issues/37)
24+
- Added V&V check for adapter content removal using FastQC/MultiQC reports from trimmed reads [#42](https://github.com/nasa/GeneLab_Data_Processing/issues/42)
25+
- Added generation of a CSV file summarizing parsed metrics from tool logs and MultiQC reports [#84](https://github.com/nasa/GeneLab_Data_Processing/issues/84)
2326

2427
### Changed
2528

@@ -55,14 +58,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
5558
- Bacteria: Ensembl bacteria release 59
5659
- Added "_GLbulkRNAseq" suffix to output files
5760
- RSeQC inner_distance minimum value now dynamically set based on read length
58-
- DESeq2 analysis now handles technical replicates
61+
- DESeq2 analysis now handles technical replicates [#32](https://github.com/nasa/GeneLab_Data_Processing/issues/32)
5962
- MultiQC reports replaced with separate data zip and html files
63+
- Increased default memory allocation for the STAR alignment process to 40GB [#36](https://github.com/nasa/GeneLab_Data_Processing/issues/36)
64+
65+
### Fixed
66+
67+
- DGE validation script (`vv_dge_deseq2.py`) error with all-integer sample names [#112](https://github.com/nasa/GeneLab_Data_Processing/issues/112)
68+
- The `--accession` parameter (formerly `--gldsAccession`) is now optional for runsheet-based workflows; if omitted, outputs default to the 'results' directory [#35](https://github.com/nasa/GeneLab_Data_Processing/issues/35)
6069

6170
### Removed
6271

6372
- ERCC-normalized DGE analysis and associated output files
64-
- GeneLab visualization output tables
65-
73+
- GeneLab visualization output tables [#41](https://github.com/nasa/GeneLab_Data_Processing/issues/41)
6674

6775
## [1.0.4](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_RCP-F_1.0.4/RNAseq/Workflow_Documentation/NF_RCP-F) - 2024-02-08
6876

@@ -76,7 +84,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7684

7785
### Changed
7886

79-
- TrimGalore! will now use autodetect for adaptor type
87+
- TrimGalore! will now use autodetect for adaptor type [#20](https://github.com/nasa/GeneLab_Data_Processing/issues/20)
8088
- V&V migrated from dp_tools version 1.1.8 to 1.3.4 including:
8189
- Migration of V&V protocol code to this codebase instead of dp_tools
8290
- Fix for sample wise checks reusing same sample

RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/add_gene_annotations.Rmd

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -35,13 +35,17 @@ suppressMessages(library(tibble))
3535
```{r, load-annotation-table}
3636
### Read in annotation table for the appropriate organism ###
3737
38-
annot <- read.table(
39-
params$annotation_file_path,
40-
sep = "\t",
41-
header = TRUE,
42-
quote = "",
43-
comment.char = "",
44-
)
38+
if (is.null(params$annotation_file_path) || params$annotation_file_path == "" || params$annotation_file_path == "null") {
39+
annot <- tibble::tibble()
40+
} else {
41+
annot <- read.table(
42+
params$annotation_file_path,
43+
sep = "\t",
44+
header = TRUE,
45+
quote = "",
46+
comment.char = ""
47+
)
48+
}
4549
```
4650

4751
```{r, load-table-to-annotate}

RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/generate_protocol.py

Lines changed: 76 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#!/usr/bin/env python3
1+
#!/usr/bin/env python
22
"""
33
This script generates a protocol text file for GeneLab RNA-seq data processing.
44
It reads software versions from a YAML file and incorporates other parameters.
@@ -9,6 +9,8 @@
99
import os
1010
import sys
1111
from datetime import datetime
12+
import pandas as pd
13+
import re
1214

1315
def parse_args():
1416
parser = argparse.ArgumentParser(description='Generate protocol file for GeneLab RNA-seq pipeline')
@@ -38,6 +40,8 @@ def parse_args():
3840
help='Path to the reference genome FASTA file')
3941
parser.add_argument('--reference_gtf', required=True,
4042
help='Path to the reference genome GTF file')
43+
parser.add_argument('--runsheet', required=False,
44+
help='Path to the runsheet CSV file')
4145
return parser.parse_args()
4246

4347
def read_software_versions(yaml_file):
@@ -267,8 +271,54 @@ def generate_protocol_content(args, software_versions):
267271
# For standard workflow (include tximport)
268272
description += f"The runsheet was generated with dp_tools (version {dp_tools_version}) and the runsheet and quantification data were imported to R (version {r_version}) with tximport (version {tximport_version}) and normalized with DESeq2 (version {deseq2_version}) median of ratios method. "
269273

274+
# Parse runsheet for technical replicate handling
275+
tech_rep_sentence = ""
276+
if hasattr(args, 'runsheet') and args.runsheet and os.path.exists(args.runsheet):
277+
try:
278+
runsheet_df = pd.read_csv(args.runsheet)
279+
if 'Sample Name' in runsheet_df.columns:
280+
# Remove whitespace and NA
281+
sample_names = runsheet_df['Sample Name'].dropna().astype(str).tolist()
282+
# Remove trailing/leading whitespace
283+
sample_names = [s.strip() for s in sample_names]
284+
# Remove empty
285+
sample_names = [s for s in sample_names if s]
286+
# Find base names (remove _techrepN if present)
287+
base_names = [re.sub(r'_techrep\d+$', '', s) for s in sample_names]
288+
from collections import Counter
289+
base_counts = Counter(base_names)
290+
n_reps = list(base_counts.values())
291+
unique_n = set(n_reps)
292+
if all(x == 1 for x in n_reps):
293+
# No technical replicates at all
294+
tech_rep_sentence = ""
295+
elif len(unique_n) == 1 and list(unique_n)[0] > 1:
296+
# All samples have the same number of tech reps
297+
tech_rep_sentence = ("Counts from all technical replicates for each sample were summed using DESeq2's collapseReplicates function. "
298+
"These collapsed counts were then used for count normalization and differential expression analysis. ")
299+
elif len(unique_n) > 1 and min(unique_n) > 1:
300+
# All samples have tech reps, but unequal number
301+
tech_rep_sentence = ("For each sample, counts from the first n technical replicates were summed using DESeq2's collapseReplicates function. "
302+
"These collapsed counts were then used for count normalization and differential expression analysis. ")
303+
else:
304+
# Some samples have tech reps, some don't
305+
tech_rep_sentence = ("For samples with technical replicates, only the first replicate was used for count normalization and differential expression analysis. ")
306+
except Exception as e:
307+
tech_rep_sentence = ""
308+
# If no runsheet, leave tech_rep_sentence as empty
309+
310+
# Add ERCC normalization sentence if ERCC spike-ins were used
311+
if args.has_ercc.lower() == "true":
312+
description += ("The data were normalized twice, each time using a different size factor. "
313+
"The first used non-ERCC genes for size factor estimation, and the second used only ERCC group B genes to estimate the size factor. "
314+
"Both sets of normalized gene counts were subject to differential expression analysis. ")
315+
else:
316+
description += "Normalized gene counts were subject to differential expression analysis. "
317+
# Add tech rep sentence
318+
if tech_rep_sentence:
319+
description += tech_rep_sentence
270320
# Add differential expression analysis sentence
271-
description += f"Normalized gene counts were subject to differential expression analysis. Differential expression analysis was performed in R (version {r_version}) using DESeq2 (version {deseq2_version}); all groups were compared pairwise using the Wald test and the likelihood ratio test was used to generate the F statistic p-value. "
321+
description += f"Differential expression analysis was performed in R (version {r_version}) using DESeq2 (version {deseq2_version}); all groups were compared pairwise using the Wald test and the likelihood ratio test was used to generate the F statistic p-value. "
272322

273323
# Add gene annotations section
274324
# Define versions for annotation packages
@@ -321,15 +371,32 @@ def generate_protocol_content(args, software_versions):
321371
organism_formatted = args.organism.replace(' ', '_').replace('-', '_').lower()
322372

323373
# Build gene annotations sentence
324-
gene_annotations_text = f"Gene annotations were assigned using the custom annotation tables generated in-house as detailed in GL-DPPD-7110-A (https://github.com/nasa/GeneLab_Data_Processing/blob/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md), with STRINGdb (version {stringdb_version}) and PANTHER.db (version {pantherdb_version})"
325-
326-
# Add organism-specific annotation package if it's an officially supported organism
374+
annotation_sources = []
375+
# Only include STRINGdb if not in organisms_without_string
376+
if organism_formatted not in organisms_without_string:
377+
annotation_sources.append(f"STRINGdb (version {stringdb_version})")
378+
# Only include PANTHER.db if not in organisms_without_panther
379+
if organism_formatted not in organisms_without_panther:
380+
annotation_sources.append(f"PANTHER.db (version {pantherdb_version})")
381+
# Custom annotation package
382+
custom_pkg = None
383+
if hasattr(args, 'organism') and args.organism and args.organism in organisms_with_custom_annotations:
384+
custom_pkg = "a custom annotation package generated in-house using AnnotationForge"
385+
annotation_sources.append(custom_pkg)
386+
# org.*.eg.db package
327387
if organism_formatted and organism_formatted in organism_annotation_packages:
328388
package_name, package_version = organism_annotation_packages[organism_formatted]
329-
gene_annotations_text += f", and {package_name} (version {package_version})"
330-
331-
# Complete the gene annotations sentence
332-
gene_annotations_text += "."
389+
annotation_sources.append(f"{package_name} (version {package_version})")
390+
# Build the sentence
391+
base_text = ("Gene annotations were assigned using the custom annotation tables generated in-house as detailed in GL-DPPD-7110-A "
392+
"(https://github.com/nasa/GeneLab_Data_Processing/blob/GL_RefAnnotTable-A_1.1.0/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md)")
393+
if annotation_sources:
394+
if len(annotation_sources) == 1:
395+
gene_annotations_text = f"{base_text}, with {annotation_sources[0]}."
396+
else:
397+
gene_annotations_text = f"{base_text}, with {', '.join(annotation_sources[:-1])}, and {annotation_sources[-1]}."
398+
else:
399+
gene_annotations_text = f"{base_text}."
333400
description += gene_annotations_text
334401

335402
# Add ERCC assessment sentence if ERCC spike-ins were used

RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/software_versions.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222

2323
CONFIG = {
2424
"rnaseq": [
25+
["NF_RCP", "https://github.com/nasa/GeneLab_Data_Processing/tree/master/RNAseq"],
2526
["Nextflow", "https://github.com/nextflow-io/nextflow"],
2627
["dp_tools", "https://github.com/J-81/dp_tools"],
2728
["FastQC", "https://www.bioinformatics.babraham.ac.uk/projects/fastqc/"],
@@ -74,12 +75,16 @@ def compare_versions(v1, v2):
7475
# Fallback for non-standard version strings
7576
return str(v1) > str(v2)
7677

77-
def main(versions_json_path: Path, output_path: Path, assay: str = 'rnaseq'):
78+
def main(versions_json_path: Path, output_path: Path, assay: str = 'rnaseq', workflow: str = None, workflow_version: str = None):
7879
software_urls = {name: url for name, url in CONFIG[assay]}
7980
known_names = CONFIG[assay]
8081

8182
processed_versions = {}
8283

84+
# Add workflow version if provided
85+
if workflow and workflow_version:
86+
processed_versions[workflow] = workflow_version
87+
8388
with versions_json_path.open() as f:
8489
data = yaml.safe_load(f)
8590
# Flatten nested structure
@@ -155,7 +160,9 @@ def main(versions_json_path: Path, output_path: Path, assay: str = 'rnaseq'):
155160
@click.argument('input', type=click.Path(exists=True))
156161
@click.argument('output', type=click.Path())
157162
@click.option('--assay', type=click.Choice(['rnaseq']), default='rnaseq')
158-
def cli(input, output, assay):
159-
main(Path(input), Path(output), assay)
163+
@click.option('--workflow', type=str, help='Workflow name')
164+
@click.option('--workflow_version', type=str, help='Workflow version')
165+
def cli(input, output, assay, workflow, workflow_version):
166+
main(Path(input), Path(output), assay, workflow, workflow_version)
160167

161168
cli()

RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/sort_into_subdirectories.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,14 @@
2424
# For a given directory, sort all files into {sample: str, [files: str]}
2525
files_by_sample = dict()
2626
for sample in samples:
27+
sample = str(sample) # Add this line before the path is constructed
2728
pattern = f"{sample}{args.glob_suffix}"
2829
print(f"Looking for files matching: {pattern}")
2930
files_for_this_sample = list(Path(args.from_dir).glob(pattern))
3031

3132
# Move files
3233
for file in files_for_this_sample:
33-
dest = Path(args.to_dir) / sample / file.name
34+
dest = Path(args.to_dir) / str(sample) / file.name
3435
print(f"Moving {file} to {dest}")
3536
dest.parent.mkdir( parents=True, exist_ok=True )
3637
shutil.move(file, dest)

RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/sort_into_subdirectories_by_sample.py

Lines changed: 0 additions & 29 deletions
This file was deleted.

0 commit comments

Comments
 (0)