Skip to content

Commit 1d3fa66

Browse files
authored
Merge pull request #165 from torres-alexis/DEV_RNAseq_vG
- Fixed fastqc parsing in parse_multiqc.py - Updated documentation with new version number - Added validation output file for parse_multiqc.py
2 parents 36ead3d + cbe694c commit 1d3fa66

File tree

8 files changed

+247
-23
lines changed

8 files changed

+247
-23
lines changed

RNAseq/Workflow_Documentation/NF_RCP/CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,12 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [2.0.1] - (https://github.com/nasa/GeneLab_Data_Processing/tree/NF_RCP_2.0.1/RNAseq/Workflow_Documentation/NF_RCP) - 2025-06-26
9+
10+
### Fixed
11+
12+
- Fixed trimming metrics extraction in `parse_multiqc.py` script
13+
814
## [2.0.0](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_RCP_2.0.0/RNAseq/Workflow_Documentation/NF_RCP) - 2025-04-10
915

1016
### Added

RNAseq/Workflow_Documentation/NF_RCP/README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -128,9 +128,9 @@ All files required for utilizing the NF_RCP GeneLab workflow for processing RNAs
128128
copy of latest NF_RCP version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands:
129129
130130
```bash
131-
wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_RCP_2.0.0/NF_RCP_2.0.0.zip
131+
wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_RCP_2.0.1/NF_RCP_2.0.1.zip
132132
133-
unzip NF_RCP_2.0.0.zip
133+
unzip NF_RCP_2.0.1.zip
134134
```
135135
136136
<br>
@@ -142,10 +142,10 @@ unzip NF_RCP_2.0.0.zip
142142
Although Nextflow can fetch Singularity images from a url, doing so may cause issues as detailed [here](https://github.com/nextflow-io/nextflow/issues/1210).
143143
144144
To avoid this issue, run the following command to fetch the Singularity images prior to running the NF_RCP workflow:
145-
> Note: This command should be run in the location containing the `NF_RCP_2.0.0` directory that was downloaded in [step 2](#2-download-the-workflow-files) above. Depending on your network speed, fetching the images will take ~20 minutes. Approximately 8GB of RAM is needed to download and build the Singularity images.
145+
> Note: This command should be run in the location containing the `NF_RCP_2.0.1` directory that was downloaded in [step 2](#2-download-the-workflow-files) above. Depending on your network speed, fetching the images will take ~20 minutes. Approximately 8GB of RAM is needed to download and build the Singularity images.
146146
147147
```bash
148-
bash NF_RCP_2.0.0/bin/prepull_singularity.sh NF_RCP_2.0.0/config/software/by_docker_image.config
148+
bash NF_RCP_2.0.1/bin/prepull_singularity.sh NF_RCP_2.0.1/config/software/by_docker_image.config
149149
```
150150
151151
@@ -161,7 +161,7 @@ export NXF_SINGULARITY_CACHEDIR=$(pwd)/singularity
161161
162162
### 4. Run the Workflow
163163
164-
While in the location containing the `NF_RCP_2.0.0` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow.
164+
While in the location containing the `NF_RCP_2.0.1` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow.
165165
166166
Both workflows automatically load reference files and organism-specific gene annotation files from the [GeneLab annotations table](https://github.com/nasa/GeneLab_Data_Processing/blob/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv). For organisms not listed in the table or to use alternative reference files, additional workflow parameters can be specified.
167167
@@ -175,7 +175,7 @@ Both workflows automatically load reference files and organism-specific gene ann
175175
#### 4a. Approach 1: Run the workflow on a GeneLab RNAseq dataset with automatic retrieval of reference fasta and gtf files
176176
177177
```bash
178-
nextflow run NF_RCP_2.0.0/main.nf \
178+
nextflow run NF_RCP_2.0.1/main.nf \
179179
-profile singularity,local \
180180
--accession OSD-194
181181
```
@@ -187,7 +187,7 @@ nextflow run NF_RCP_2.0.0/main.nf \
187187
#### 4b. Approach 2: Run the workflow on a GeneLab RNAseq dataset with custom reference fasta and gtf files
188188
189189
```bash
190-
nextflow run NF_RCP_2.0.0/main.nf \
190+
nextflow run NF_RCP_2.0.1/main.nf \
191191
-profile singularity,local \
192192
--accession OSD-194 \
193193
--reference_version 112 \
@@ -205,7 +205,7 @@ nextflow run NF_RCP_2.0.0/main.nf \
205205
#### 4c. Approach 3: Run the workflow on a non-GeneLab dataset using a user-created runsheet with automatic retrieval of reference fasta and gtf files
206206
207207
```bash
208-
nextflow run NF_RCP_2.0.0/main.nf \
208+
nextflow run NF_RCP_2.0.1/main.nf \
209209
-profile singularity,local \
210210
--runsheet_path </path/to/runsheet>
211211
```
@@ -217,7 +217,7 @@ nextflow run NF_RCP_2.0.0/main.nf \
217217
#### 4d. Approach 4: Run the workflow on a non-GeneLab dataset using a user-created runsheet with custom reference fasta and gtf files
218218
219219
```bash
220-
nextflow run NF_RCP_2.0.0/main.nf \
220+
nextflow run NF_RCP_2.0.1/main.nf \
221221
-profile singularity \
222222
--accession OSD-194 \
223223
--reference_version 112 \
@@ -235,7 +235,7 @@ nextflow run NF_RCP_2.0.0/main.nf \
235235
236236
#### Required Parameters For All Approaches:
237237
238-
* `NF_RCP_2.0.0/main.nf` - Instructs Nextflow to run the NF_RCP workflow
238+
* `NF_RCP_2.0.1/main.nf` - Instructs Nextflow to run the NF_RCP workflow
239239
240240
* `-profile` - Specifies the configuration profile(s) to load, `singularity` instructs Nextflow to setup and use singularity for all software called in the workflow; use `local` for local execution ([local.config](workflow_code/conf/local.config)) or `slurm` for SLURM cluster execution ([slurm.config](workflow_code/conf/slurm.config))
241241
> Note: The output directory will be named `GLDS-#` when using a OSD or GLDS accession as input, or `results` when running the workflow with only a runsheet as input.
@@ -313,7 +313,7 @@ nextflow run NF_RCP_2.0.0/main.nf \
313313
All parameters listed above and additional optional arguments for the RCP workflow, including debug related options that may not be immediately useful for most users, can be viewed by running the following command:
314314
315315
```bash
316-
nextflow run NF_RCP_2.0.0/main.nf --help
316+
nextflow run NF_RCP_2.0.1/main.nf --help
317317
```
318318
319319
See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details common to all nextflow workflows.

RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/parse_multiqc.py

Lines changed: 221 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,180 @@
1414
import os
1515

1616

17-
def main(osd_num, paired_end, assay_suffix, mode):
17+
def get_runsheet_order(runsheet_path):
18+
"""Read runsheet and return ordered list of sample names"""
19+
if not runsheet_path or not os.path.exists(runsheet_path):
20+
return None
21+
try:
22+
df = pd.read_csv(runsheet_path)
23+
if 'Sample Name' in df.columns:
24+
return df['Sample Name'].tolist()
25+
except Exception as e:
26+
print(f"WARNING: Error reading runsheet: {str(e)}")
27+
return None
28+
29+
30+
def generate_validation_report(fieldnames, populated_fields, mode, assay_suffix, paired_end):
31+
"""Generate a validation report showing which columns are missing data"""
32+
33+
# Define field categories
34+
metadata_fields = [
35+
'osd_num', 'sample', 'organism', 'tissue', 'sequencing_instrument',
36+
'library_selection', 'library_layout', 'strandedness', 'read_depth',
37+
'read_length', 'rrna_contamination', 'rin', 'organism_part', 'cell_line',
38+
'cell_type', 'secondary_organism', 'strain', 'animal_source', 'seed_source',
39+
'source_accession', 'mix'
40+
]
41+
42+
gene_count_fields = ['gene_detected_gt10', 'gene_total', 'gene_detected_gt10_pct']
43+
44+
fastqc_raw_fields = [f for f in fieldnames if f.startswith('raw_')]
45+
fastqc_trimmed_fields = [f for f in fieldnames if f.startswith('trimmed_')]
46+
47+
# For single-end data, exclude _r fields from FastQC sections
48+
if not paired_end:
49+
fastqc_raw_fields = [f for f in fastqc_raw_fields if not f.endswith('_r')]
50+
fastqc_trimmed_fields = [f for f in fastqc_trimmed_fields if not f.endswith('_r')]
51+
52+
star_fields = [
53+
'uniquely_mapped_percent', 'multimapped_percent', 'multimapped_toomany_percent',
54+
'unmapped_tooshort_percent', 'unmapped_other_percent'
55+
]
56+
57+
bowtie2_fields = [
58+
'total_reads', 'overall_alignment_rate', 'aligned_none', 'aligned_one', 'aligned_multi'
59+
]
60+
61+
featurecounts_fields = [
62+
'total_count', 'num_assigned', 'pct_assigned', 'num_unassigned_nofeatures',
63+
'num_unassigned_ambiguity', 'pct_unassigned_nofeatures', 'pct_unassigned_ambiguity'
64+
]
65+
66+
rseqc_fields = [
67+
'mean_genebody_cov_5_20', 'mean_genebody_cov_40_60', 'mean_genebody_cov_80_95',
68+
'ratio_genebody_cov_3_to_5', 'pct_sense', 'pct_antisense', 'pct_undetermined',
69+
'cds_exons_pct', '5_utr_exons_pct', '3_utr_exons_pct', 'introns_pct',
70+
'tss_up_1kb_pct', 'tss_up_1kb_5kb_pct', 'tss_up_5kb_10kb_pct', 'tes_down_1kb_pct',
71+
'tss_down_1kb_5kb_pct', 'tss_down_5kb_10kb_pct', 'other_intergenic_pct'
72+
]
73+
74+
# Add inner distance fields only for paired-end data
75+
if paired_end:
76+
rseqc_fields.extend(['peak_inner_dist', 'peak_inner_dist_pct_reads'])
77+
78+
rsem_fields = [
79+
'num_uniquely_aligned', 'pct_uniquely_aligned', 'pct_multi_aligned',
80+
'pct_filtered', 'pct_unalignable'
81+
]
82+
83+
def get_missing_fields(field_list):
84+
return [f for f in field_list if f in fieldnames and f not in populated_fields]
85+
86+
report_filename = f'qc_validation{assay_suffix}.txt'
87+
with open(report_filename, 'w') as f:
88+
# Header
89+
f.write("QC metrics validation report\n\n")
90+
mode_display = "Microbes" if mode == 'microbes' else "Default"
91+
data_type = "Paired end" if paired_end else "Single end"
92+
f.write(f"Mode: {mode_display}\n")
93+
f.write(f"Data type: {data_type}\n\n")
94+
f.write("-" * 40 + "\n\n")
95+
96+
# Calculate summary stats
97+
missing_metadata_count = len([f for f in metadata_fields if f in fieldnames and f not in populated_fields])
98+
total_metadata_count = len([f for f in metadata_fields if f in fieldnames])
99+
100+
# Count expected MultiQC fields based on mode (excluding metadata)
101+
expected_multiqc_fields = gene_count_fields + fastqc_raw_fields + fastqc_trimmed_fields + rseqc_fields
102+
if mode == 'microbes':
103+
expected_multiqc_fields += bowtie2_fields + featurecounts_fields
104+
else:
105+
expected_multiqc_fields += star_fields + rsem_fields
106+
107+
missing_multiqc_count = len([f for f in expected_multiqc_fields if f in fieldnames and f not in populated_fields])
108+
total_multiqc_count = len([f for f in expected_multiqc_fields if f in fieldnames])
109+
110+
# Summary section
111+
f.write("Summary:\n")
112+
f.write(f"* {missing_metadata_count}/{total_metadata_count} Metadata fields are empty\n")
113+
f.write(f"* {missing_multiqc_count}/{total_multiqc_count} expected MultiQC fields are empty\n\n")
114+
115+
f.write("Categories missing entries:\n\n")
116+
117+
# Metadata
118+
missing_metadata = get_missing_fields(metadata_fields)
119+
if missing_metadata:
120+
f.write("* Metadata:\n")
121+
for field in missing_metadata:
122+
f.write(f"** {field}\n")
123+
f.write("\n")
124+
125+
# Gene count
126+
missing_gene_count = get_missing_fields(gene_count_fields)
127+
if missing_gene_count:
128+
f.write("* Gene Count:\n")
129+
for field in missing_gene_count:
130+
f.write(f"** {field}\n")
131+
f.write("\n")
132+
133+
# FastQC Raw
134+
missing_fastqc_raw = get_missing_fields(fastqc_raw_fields)
135+
if missing_fastqc_raw:
136+
f.write("* FastQC Raw:\n")
137+
for field in missing_fastqc_raw:
138+
f.write(f"** {field}\n")
139+
f.write("\n")
140+
141+
# FastQC Trimmed
142+
missing_fastqc_trimmed = get_missing_fields(fastqc_trimmed_fields)
143+
if missing_fastqc_trimmed:
144+
f.write("* FastQC Trimmed:\n")
145+
for field in missing_fastqc_trimmed:
146+
f.write(f"** {field}\n")
147+
f.write("\n")
148+
149+
# Mode-specific sections
150+
if mode == 'microbes':
151+
# For microbes mode, check Bowtie2 and FeatureCounts (STAR/RSEM expected to be empty)
152+
missing_bowtie2 = get_missing_fields(bowtie2_fields)
153+
if missing_bowtie2:
154+
f.write("* Bowtie2 (Prokaryotes):\n")
155+
for field in missing_bowtie2:
156+
f.write(f"** {field}\n")
157+
f.write("\n")
158+
159+
missing_featurecounts = get_missing_fields(featurecounts_fields)
160+
if missing_featurecounts:
161+
f.write("* FeatureCounts (Prokaryotes):\n")
162+
for field in missing_featurecounts:
163+
f.write(f"** {field}\n")
164+
f.write("\n")
165+
else:
166+
# For eukaryotic mode, check STAR and RSEM (Bowtie2/FeatureCounts expected to be empty)
167+
missing_star = get_missing_fields(star_fields)
168+
if missing_star:
169+
f.write("* STAR (Eukaryotes):\n")
170+
for field in missing_star:
171+
f.write(f"** {field}\n")
172+
f.write("\n")
173+
174+
missing_rsem = get_missing_fields(rsem_fields)
175+
if missing_rsem:
176+
f.write("* RSEM (Eukaryotes):\n")
177+
for field in missing_rsem:
178+
f.write(f"** {field}\n")
179+
f.write("\n")
180+
181+
# RSeQC (relevant for both modes)
182+
missing_rseqc = get_missing_fields(rseqc_fields)
183+
if missing_rseqc:
184+
f.write("* RSeQC:\n")
185+
for field in missing_rseqc:
186+
f.write(f"** {field}\n")
187+
f.write("\n")
188+
189+
190+
def main(osd_num, paired_end, assay_suffix, mode, runsheet=None):
18191

19192
osd_num = osd_num.split('-')[1]
20193

@@ -47,6 +220,15 @@ def main(osd_num, paired_end, assay_suffix, mode):
47220

48221
samples = set([s for ss in multiqc_data for s in ss])
49222

223+
# Order samples according to runsheet if provided
224+
runsheet_order = get_runsheet_order(runsheet)
225+
if runsheet_order:
226+
ordered_samples = [s for s in runsheet_order if s in samples]
227+
extra_samples = [s for s in samples if s not in runsheet_order]
228+
samples = ordered_samples + extra_samples
229+
else:
230+
samples = sorted(samples) # Fallback to alphabetical
231+
50232
metadata = get_metadata(osd_num)
51233

52234
fieldnames = [
@@ -90,6 +272,10 @@ def main(osd_num, paired_end, assay_suffix, mode):
90272
fieldnames_set = set(fieldnames)
91273

92274
output_filename = f'qc_metrics{assay_suffix}.csv'
275+
276+
# Track which fields have data for validation report
277+
populated_fields = set()
278+
93279
with open(output_filename, mode='w', newline='') as f:
94280
writer = csv.DictWriter(f, fieldnames=fieldnames)
95281
writer.writeheader()
@@ -103,13 +289,30 @@ def main(osd_num, paired_end, assay_suffix, mode):
103289
# Only keep fields that are in the fieldnames list
104290
if k in fieldnames_set:
105291
all_fields[k] = v
292+
if v is not None and v != '': # Track populated fields
293+
populated_fields.add(k)
106294
else:
107295
# Optionally add debug output to see which fields are being skipped
108296
# print(f"Skipping field not in fieldnames: {k}")
109297
pass
110298

111-
# Write the row with filtered fields
299+
# Track fields that are always populated
300+
populated_fields.add('osd_num')
301+
populated_fields.add('sample')
302+
303+
# Track populated metadata fields
304+
for k, v in metadata.items():
305+
if v is not None and v != '':
306+
populated_fields.add(k)
307+
308+
# Write rows with osd_num and sample fields
112309
writer.writerow({'osd_num': 'OSD-' + osd_num, 'sample': sample, **metadata, **all_fields})
310+
311+
# Generate validation report
312+
try:
313+
generate_validation_report(fieldnames, populated_fields, mode, assay_suffix, paired_end)
314+
except Exception as e:
315+
print(f"WARNING: Failed to generate validation report: {str(e)}")
113316

114317

115318
def get_metadata(osd_num):
@@ -187,9 +390,20 @@ def parse_fastqc(prefix, assay_suffix):
187390
with open(f'{prefix}_multiqc{assay_suffix}_data/multiqc_data.json') as f:
188391
j = json.loads(f.read())
189392

393+
# Find FastQC section by looking for FastQC-specific fields
394+
fastqc_section = None
395+
for section in j['report_general_stats_data']:
396+
sample_data = next(iter(section.values()), {})
397+
if 'total_sequences' in sample_data and 'percent_gc' in sample_data:
398+
fastqc_section = section
399+
break
400+
401+
if not fastqc_section:
402+
return {}
403+
190404
# Group the samples by base name for paired end data
191405
sample_groups = {}
192-
for sample in j['report_general_stats_data'][-1].keys():
406+
for sample in fastqc_section.keys():
193407
# Handle various naming patterns
194408
if ' Read 1' in sample:
195409
base_name = sample.replace(' Read 1', '')
@@ -225,13 +439,13 @@ def parse_fastqc(prefix, assay_suffix):
225439

226440
# Process forward read
227441
if reads['f']:
228-
for k, v in j['report_general_stats_data'][-1][reads['f']].items():
442+
for k, v in fastqc_section[reads['f']].items():
229443
if k != 'percent_fails':
230444
data[base_name][prefix + '_' + k + '_f'] = v
231445

232446
# Process reverse read
233447
if reads['r']:
234-
for k, v in j['report_general_stats_data'][-1][reads['r']].items():
448+
for k, v in fastqc_section[reads['r']].items():
235449
if k != 'percent_fails':
236450
data[base_name][prefix + '_' + k + '_r'] = v
237451

@@ -539,6 +753,7 @@ def get_genecount(assay_suffix, mode):
539753
parser.add_argument('--paired', action='store_true')
540754
parser.add_argument('--assay_suffix', default='_GLbulkRNAseq')
541755
parser.add_argument('--mode', default='default')
756+
parser.add_argument('--runsheet', default=None)
542757
args = parser.parse_args()
543758

544-
main(args.osd_num, args.paired, args.assay_suffix, args.mode)
759+
main(args.osd_num, args.paired, args.assay_suffix, args.mode, args.runsheet)

0 commit comments

Comments
 (0)