nasa
diff --git a/‎RNAseq/Workflow_Documentation/NF_RCP/CHANGELOG.md
Lines changed: 6 additions & 0 deletions b/‎RNAseq/Workflow_Documentation/NF_RCP/CHANGELOG.md
Lines changed: 6 additions & 0 deletions
diff --git a/‎RNAseq/Workflow_Documentation/NF_RCP/README.md
Lines changed: 11 additions & 11 deletions b/‎RNAseq/Workflow_Documentation/NF_RCP/README.md
Lines changed: 11 additions & 11 deletions
diff --git a/‎RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/parse_multiqc.py
Lines changed: 221 additions & 6 deletions b/‎RNAseq/Workflow_Documentation/NF_RCP/workflow_code/bin/parse_multiqc.py
Lines changed: 221 additions & 6 deletions
@@ -5,6 +5,12 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [2.0.1] - (https://github.com/nasa/GeneLab_Data_Processing/tree/NF_RCP_2.0.1/RNAseq/Workflow_Documentation/NF_RCP) - 2025-06-26
+
+### Fixed
+
+- Fixed trimming metrics extraction in `parse_multiqc.py` script 
+
 ## [2.0.0](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_RCP_2.0.0/RNAseq/Workflow_Documentation/NF_RCP) - 2025-04-10
 
 ### Added
 
@@ -128,9 +128,9 @@ All files required for utilizing the NF_RCP GeneLab workflow for processing RNAs
 copy of latest NF_RCP version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands: 
 
 ```bash
-wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_RCP_2.0.0/NF_RCP_2.0.0.zip
+wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_RCP_2.0.1/NF_RCP_2.0.1.zip
 
-unzip NF_RCP_2.0.0.zip
+unzip NF_RCP_2.0.1.zip
 ```
 
 <br>
@@ -142,10 +142,10 @@ unzip NF_RCP_2.0.0.zip
 Although Nextflow can fetch Singularity images from a url, doing so may cause issues as detailed [here](https://github.com/nextflow-io/nextflow/issues/1210).
 
 To avoid this issue, run the following command to fetch the Singularity images prior to running the NF_RCP workflow:
-> Note: This command should be run in the location containing the `NF_RCP_2.0.0` directory that was downloaded in [step 2](#2-download-the-workflow-files) above. Depending on your network speed, fetching the images will take ~20 minutes. Approximately 8GB of RAM is needed to download and build the Singularity images.
+> Note: This command should be run in the location containing the `NF_RCP_2.0.1` directory that was downloaded in [step 2](#2-download-the-workflow-files) above. Depending on your network speed, fetching the images will take ~20 minutes. Approximately 8GB of RAM is needed to download and build the Singularity images.
 
 ```bash
-bash NF_RCP_2.0.0/bin/prepull_singularity.sh NF_RCP_2.0.0/config/software/by_docker_image.config
+bash NF_RCP_2.0.1/bin/prepull_singularity.sh NF_RCP_2.0.1/config/software/by_docker_image.config
 ```
 
 
@@ -161,7 +161,7 @@ export NXF_SINGULARITY_CACHEDIR=$(pwd)/singularity
 
 ### 4. Run the Workflow
 
-While in the location containing the `NF_RCP_2.0.0` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow.
+While in the location containing the `NF_RCP_2.0.1` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow.
 
 Both workflows automatically load reference files and organism-specific gene annotation files from the [GeneLab annotations table](https://github.com/nasa/GeneLab_Data_Processing/blob/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv). For organisms not listed in the table or to use alternative reference files, additional workflow parameters can be specified.
 
@@ -175,7 +175,7 @@ Both workflows automatically load reference files and organism-specific gene ann
 #### 4a. Approach 1: Run the workflow on a GeneLab RNAseq dataset with automatic retrieval of reference fasta and gtf files
 
 ```bash
-nextflow run NF_RCP_2.0.0/main.nf \ 
+nextflow run NF_RCP_2.0.1/main.nf \ 
    -profile singularity,local \
    --accession OSD-194 
 ```
@@ -187,7 +187,7 @@ nextflow run NF_RCP_2.0.0/main.nf \
 #### 4b. Approach 2: Run the workflow on a GeneLab RNAseq dataset with custom reference fasta and gtf files
 
 ```bash
-nextflow run NF_RCP_2.0.0/main.nf \ 
+nextflow run NF_RCP_2.0.1/main.nf \ 
    -profile singularity,local \
    --accession OSD-194 \
    --reference_version 112 \
@@ -205,7 +205,7 @@ nextflow run NF_RCP_2.0.0/main.nf \
 #### 4c. Approach 3: Run the workflow on a non-GeneLab dataset using a user-created runsheet with automatic retrieval of reference fasta and gtf files
 
 ```bash
-nextflow run NF_RCP_2.0.0/main.nf \ 
+nextflow run NF_RCP_2.0.1/main.nf \ 
    -profile singularity,local \
    --runsheet_path </path/to/runsheet> 
 ```
@@ -217,7 +217,7 @@ nextflow run NF_RCP_2.0.0/main.nf \
 #### 4d. Approach 4: Run the workflow on a non-GeneLab dataset using a user-created runsheet with custom reference fasta and gtf files
 
 ```bash
-nextflow run NF_RCP_2.0.0/main.nf \ 
+nextflow run NF_RCP_2.0.1/main.nf \ 
    -profile singularity \
    --accession OSD-194 \
    --reference_version 112 \
@@ -235,7 +235,7 @@ nextflow run NF_RCP_2.0.0/main.nf \
 
 #### Required Parameters For All Approaches:
 
-* `NF_RCP_2.0.0/main.nf` - Instructs Nextflow to run the NF_RCP workflow 
+* `NF_RCP_2.0.1/main.nf` - Instructs Nextflow to run the NF_RCP workflow 
 
 * `-profile` - Specifies the configuration profile(s) to load, `singularity` instructs Nextflow to setup and use singularity for all software called in the workflow; use `local` for local execution ([local.config](workflow_code/conf/local.config)) or `slurm` for SLURM cluster execution ([slurm.config](workflow_code/conf/slurm.config))
   > Note: The output directory will be named `GLDS-#` when using a OSD or GLDS accession as input, or `results` when running the workflow with only a runsheet as input.
@@ -313,7 +313,7 @@ nextflow run NF_RCP_2.0.0/main.nf \
 All parameters listed above and additional optional arguments for the RCP workflow, including debug related options that may not be immediately useful for most users, can be viewed by running the following command:
 
 ```bash
-nextflow run NF_RCP_2.0.0/main.nf --help
+nextflow run NF_RCP_2.0.1/main.nf --help
 ```
 
 See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details common to all nextflow workflows.
 
@@ -14,7 +14,180 @@
 import os
 
 
-def main(osd_num, paired_end, assay_suffix, mode):
+def get_runsheet_order(runsheet_path):
+    """Read runsheet and return ordered list of sample names"""
+    if not runsheet_path or not os.path.exists(runsheet_path):
+        return None
+    try:
+        df = pd.read_csv(runsheet_path)
+        if 'Sample Name' in df.columns:
+            return df['Sample Name'].tolist()
+    except Exception as e:
+        print(f"WARNING: Error reading runsheet: {str(e)}")
+    return None
+
+
+def generate_validation_report(fieldnames, populated_fields, mode, assay_suffix, paired_end):
+    """Generate a validation report showing which columns are missing data"""
+    
+    # Define field categories
+    metadata_fields = [
+        'osd_num', 'sample', 'organism', 'tissue', 'sequencing_instrument', 
+        'library_selection', 'library_layout', 'strandedness', 'read_depth', 
+        'read_length', 'rrna_contamination', 'rin', 'organism_part', 'cell_line', 
+        'cell_type', 'secondary_organism', 'strain', 'animal_source', 'seed_source', 
+        'source_accession', 'mix'
+    ]
+    
+    gene_count_fields = ['gene_detected_gt10', 'gene_total', 'gene_detected_gt10_pct']
+    
+    fastqc_raw_fields = [f for f in fieldnames if f.startswith('raw_')]
+    fastqc_trimmed_fields = [f for f in fieldnames if f.startswith('trimmed_')]
+    
+    # For single-end data, exclude _r fields from FastQC sections
+    if not paired_end:
+        fastqc_raw_fields = [f for f in fastqc_raw_fields if not f.endswith('_r')]
+        fastqc_trimmed_fields = [f for f in fastqc_trimmed_fields if not f.endswith('_r')]
+    
+    star_fields = [
+        'uniquely_mapped_percent', 'multimapped_percent', 'multimapped_toomany_percent',
+        'unmapped_tooshort_percent', 'unmapped_other_percent'
+    ]
+    
+    bowtie2_fields = [
+        'total_reads', 'overall_alignment_rate', 'aligned_none', 'aligned_one', 'aligned_multi'
+    ]
+    
+    featurecounts_fields = [
+        'total_count', 'num_assigned', 'pct_assigned', 'num_unassigned_nofeatures',
+        'num_unassigned_ambiguity', 'pct_unassigned_nofeatures', 'pct_unassigned_ambiguity'
+    ]
+    
+    rseqc_fields = [
+        'mean_genebody_cov_5_20', 'mean_genebody_cov_40_60', 'mean_genebody_cov_80_95',
+        'ratio_genebody_cov_3_to_5', 'pct_sense', 'pct_antisense', 'pct_undetermined',
+        'cds_exons_pct', '5_utr_exons_pct', '3_utr_exons_pct', 'introns_pct', 
+        'tss_up_1kb_pct', 'tss_up_1kb_5kb_pct', 'tss_up_5kb_10kb_pct', 'tes_down_1kb_pct', 
+        'tss_down_1kb_5kb_pct', 'tss_down_5kb_10kb_pct', 'other_intergenic_pct'
+    ]
+    
+    # Add inner distance fields only for paired-end data
+    if paired_end:
+        rseqc_fields.extend(['peak_inner_dist', 'peak_inner_dist_pct_reads'])
+    
+    rsem_fields = [
+        'num_uniquely_aligned', 'pct_uniquely_aligned', 'pct_multi_aligned',
+        'pct_filtered', 'pct_unalignable'
+    ]
+    
+    def get_missing_fields(field_list):
+        return [f for f in field_list if f in fieldnames and f not in populated_fields]
+    
+    report_filename = f'qc_validation{assay_suffix}.txt'
+    with open(report_filename, 'w') as f:
+        # Header
+        f.write("QC metrics validation report\n\n")
+        mode_display = "Microbes" if mode == 'microbes' else "Default"
+        data_type = "Paired end" if paired_end else "Single end"
+        f.write(f"Mode: {mode_display}\n")
+        f.write(f"Data type: {data_type}\n\n")
+        f.write("-" * 40 + "\n\n")
+        
+        # Calculate summary stats
+        missing_metadata_count = len([f for f in metadata_fields if f in fieldnames and f not in populated_fields])
+        total_metadata_count = len([f for f in metadata_fields if f in fieldnames])
+        
+        # Count expected MultiQC fields based on mode (excluding metadata)
+        expected_multiqc_fields = gene_count_fields + fastqc_raw_fields + fastqc_trimmed_fields + rseqc_fields
+        if mode == 'microbes':
+            expected_multiqc_fields += bowtie2_fields + featurecounts_fields
+        else:
+            expected_multiqc_fields += star_fields + rsem_fields
+        
+        missing_multiqc_count = len([f for f in expected_multiqc_fields if f in fieldnames and f not in populated_fields])
+        total_multiqc_count = len([f for f in expected_multiqc_fields if f in fieldnames])
+        
+        # Summary section
+        f.write("Summary:\n")
+        f.write(f"* {missing_metadata_count}/{total_metadata_count} Metadata fields are empty\n")
+        f.write(f"* {missing_multiqc_count}/{total_multiqc_count} expected MultiQC fields are empty\n\n")
+        
+        f.write("Categories missing entries:\n\n")
+        
+        # Metadata
+        missing_metadata = get_missing_fields(metadata_fields)
+        if missing_metadata:
+            f.write("* Metadata:\n")
+            for field in missing_metadata:
+                f.write(f"** {field}\n")
+            f.write("\n")
+        
+        # Gene count
+        missing_gene_count = get_missing_fields(gene_count_fields)
+        if missing_gene_count:
+            f.write("* Gene Count:\n")
+            for field in missing_gene_count:
+                f.write(f"** {field}\n")
+            f.write("\n")
+        
+        # FastQC Raw
+        missing_fastqc_raw = get_missing_fields(fastqc_raw_fields)
+        if missing_fastqc_raw:
+            f.write("* FastQC Raw:\n")
+            for field in missing_fastqc_raw:
+                f.write(f"** {field}\n")
+            f.write("\n")
+        
+        # FastQC Trimmed
+        missing_fastqc_trimmed = get_missing_fields(fastqc_trimmed_fields)
+        if missing_fastqc_trimmed:
+            f.write("* FastQC Trimmed:\n")
+            for field in missing_fastqc_trimmed:
+                f.write(f"** {field}\n")
+            f.write("\n")
+        
+        # Mode-specific sections
+        if mode == 'microbes':
+            # For microbes mode, check Bowtie2 and FeatureCounts (STAR/RSEM expected to be empty)
+            missing_bowtie2 = get_missing_fields(bowtie2_fields)
+            if missing_bowtie2:
+                f.write("* Bowtie2 (Prokaryotes):\n")
+                for field in missing_bowtie2:
+                    f.write(f"** {field}\n")
+                f.write("\n")
+            
+            missing_featurecounts = get_missing_fields(featurecounts_fields)
+            if missing_featurecounts:
+                f.write("* FeatureCounts (Prokaryotes):\n")
+                for field in missing_featurecounts:
+                    f.write(f"** {field}\n")
+                f.write("\n")
+        else:
+            # For eukaryotic mode, check STAR and RSEM (Bowtie2/FeatureCounts expected to be empty)
+            missing_star = get_missing_fields(star_fields)
+            if missing_star:
+                f.write("* STAR (Eukaryotes):\n")
+                for field in missing_star:
+                    f.write(f"** {field}\n")
+                f.write("\n")
+            
+            missing_rsem = get_missing_fields(rsem_fields)
+            if missing_rsem:
+                f.write("* RSEM (Eukaryotes):\n")
+                for field in missing_rsem:
+                    f.write(f"** {field}\n")
+                f.write("\n")
+        
+        # RSeQC (relevant for both modes)
+        missing_rseqc = get_missing_fields(rseqc_fields)
+        if missing_rseqc:
+            f.write("* RSeQC:\n")
+            for field in missing_rseqc:
+                f.write(f"** {field}\n")
+            f.write("\n")
+
+
+def main(osd_num, paired_end, assay_suffix, mode, runsheet=None):
 
     osd_num = osd_num.split('-')[1]
 
@@ -47,6 +220,15 @@ def main(osd_num, paired_end, assay_suffix, mode):
 
     samples = set([s for ss in multiqc_data for s in ss])
 
+    # Order samples according to runsheet if provided
+    runsheet_order = get_runsheet_order(runsheet)
+    if runsheet_order:
+        ordered_samples = [s for s in runsheet_order if s in samples]
+        extra_samples = [s for s in samples if s not in runsheet_order]
+        samples = ordered_samples + extra_samples
+    else:
+        samples = sorted(samples)  # Fallback to alphabetical
+
     metadata = get_metadata(osd_num)
 
     fieldnames = [
@@ -90,6 +272,10 @@ def main(osd_num, paired_end, assay_suffix, mode):
     fieldnames_set = set(fieldnames)
 
     output_filename = f'qc_metrics{assay_suffix}.csv'
+    
+    # Track which fields have data for validation report
+    populated_fields = set()
+    
     with open(output_filename, mode='w', newline='') as f:
         writer = csv.DictWriter(f, fieldnames=fieldnames)
         writer.writeheader()
@@ -103,13 +289,30 @@ def main(osd_num, paired_end, assay_suffix, mode):
                         # Only keep fields that are in the fieldnames list
                         if k in fieldnames_set:
                             all_fields[k] = v
+                            if v is not None and v != '':  # Track populated fields
+                                populated_fields.add(k)
                         else:
                             # Optionally add debug output to see which fields are being skipped
                             # print(f"Skipping field not in fieldnames: {k}")
                             pass
 
-            # Write the row with filtered fields
+            # Track fields that are always populated
+            populated_fields.add('osd_num')
+            populated_fields.add('sample')
+            
+            # Track populated metadata fields
+            for k, v in metadata.items():
+                if v is not None and v != '':
+                    populated_fields.add(k)
+            
+            # Write rows with osd_num and sample fields
             writer.writerow({'osd_num': 'OSD-' + osd_num, 'sample': sample, **metadata, **all_fields})
+    
+    # Generate validation report
+    try:
+        generate_validation_report(fieldnames, populated_fields, mode, assay_suffix, paired_end)
+    except Exception as e:
+        print(f"WARNING: Failed to generate validation report: {str(e)}")
 
 
 def get_metadata(osd_num):
@@ -187,9 +390,20 @@ def parse_fastqc(prefix, assay_suffix):
         with open(f'{prefix}_multiqc{assay_suffix}_data/multiqc_data.json') as f:
             j = json.loads(f.read())
 
+        # Find FastQC section by looking for FastQC-specific fields
+        fastqc_section = None
+        for section in j['report_general_stats_data']:
+            sample_data = next(iter(section.values()), {})
+            if 'total_sequences' in sample_data and 'percent_gc' in sample_data:
+                fastqc_section = section
+                break
+
+        if not fastqc_section:
+            return {}
+
         # Group the samples by base name for paired end data
         sample_groups = {}
-        for sample in j['report_general_stats_data'][-1].keys():
+        for sample in fastqc_section.keys():
             # Handle various naming patterns
             if ' Read 1' in sample:
                 base_name = sample.replace(' Read 1', '')
@@ -225,13 +439,13 @@ def parse_fastqc(prefix, assay_suffix):
 
             # Process forward read
             if reads['f']:
-                for k, v in j['report_general_stats_data'][-1][reads['f']].items():
+                for k, v in fastqc_section[reads['f']].items(): 
                     if k != 'percent_fails':
                         data[base_name][prefix + '_' + k + '_f'] = v
 
             # Process reverse read
             if reads['r']:
-                for k, v in j['report_general_stats_data'][-1][reads['r']].items():
+                for k, v in fastqc_section[reads['r']].items(): 
                     if k != 'percent_fails':
                         data[base_name][prefix + '_' + k + '_r'] = v
 
@@ -539,6 +753,7 @@ def get_genecount(assay_suffix, mode):
     parser.add_argument('--paired', action='store_true')
     parser.add_argument('--assay_suffix', default='_GLbulkRNAseq')
     parser.add_argument('--mode', default='default')
+    parser.add_argument('--runsheet', default=None)
     args = parser.parse_args()
 
-    main(args.osd_num, args.paired, args.assay_suffix, args.mode)
+    main(args.osd_num, args.paired, args.assay_suffix, args.mode, args.runsheet)