Skip to content

High-Level Diagrams of SomaticSeq's codebase #144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions .codeboarding/External_Tool_Execution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
```mermaid

graph LR

External_Tool_Orchestration["External Tool Orchestration"]

Alignment_Tool_Execution_Modules["Alignment Tool Execution Modules"]

Somatic_Caller_Execution_Modules["Somatic Caller Execution Modules"]

Container_Configuration["Container Configuration"]

External_Tool_Orchestration -- "orchestrates" --> Alignment_Tool_Execution_Modules

External_Tool_Orchestration -- "orchestrates" --> Somatic_Caller_Execution_Modules

External_Tool_Orchestration -- "uses" --> Container_Configuration

Alignment_Tool_Execution_Modules -- "uses" --> Container_Configuration

Somatic_Caller_Execution_Modules -- "uses" --> Container_Configuration

```



[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org)



## Details



This subsystem is designed to automate the generation and execution of scripts for various external bioinformatics tools within containerized environments (Docker/Singularity). Its primary goal is to produce initial raw alignment (BAM) and variant (VCF) files by orchestrating a series of specialized tool executions.



### External Tool Orchestration

This is the central orchestrator of the External Tool Execution subsystem. It is responsible for generating and managing the execution scripts for both alignment and somatic variant calling pipelines. It initiates and oversees the workflows that leverage external bioinformatics tools within containerized environments, ultimately producing the raw alignment and variant files.





**Related Classes/Methods**:



- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/makeAlignmentScripts.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/makeAlignmentScripts.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/makeSomaticScripts.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/makeSomaticScripts.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/run_workflows.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/run_workflows.py` (1:1)</a>





### Alignment Tool Execution Modules

This component comprises a collection of specialized modules, each encapsulating the specific logic and commands required to run individual external bioinformatics tools for alignment-related tasks (e.g., BWA for alignment, Picard for duplicate marking, merging BAMs/Fastqs, trimming). These modules are invoked and managed by the External Tool Orchestration component to perform the alignment steps of the pipeline.





**Related Classes/Methods**:



- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/alignments/align.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/alignments/align.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/alignments/markdup.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/alignments/markdup.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/alignments/mergeBams.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/alignments/mergeBams.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/alignments/mergeFastqs.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/alignments/mergeFastqs.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/alignments/spreadFastq.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/alignments/spreadFastq.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/alignments/trim.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/alignments/trim.py` (1:1)</a>





### Somatic Caller Execution Modules

Similar to the alignment modules, this component consists of distinct modules, each dedicated to running a specific external somatic variant calling bioinformatics tool (e.g., MuTect2, VarDict, Strelka2, SomaticSniper). These modules contain the necessary commands and configurations for executing the callers within containerized environments, contributing to the generation of raw variant call files. They are orchestrated by the External Tool Orchestration component.





**Related Classes/Methods**:



- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/somatic_mutations/JointSNVMix2.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/somatic_mutations/JointSNVMix2.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/somatic_mutations/LoFreq.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/somatic_mutations/LoFreq.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/somatic_mutations/MuSE.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/somatic_mutations/MuSE.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/somatic_mutations/MuTect2.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/somatic_mutations/MuTect2.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/somatic_mutations/Scalpel.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/somatic_mutations/Scalpel.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/somatic_mutations/SomaticSniper.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/somatic_mutations/SomaticSniper.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/somatic_mutations/Strelka2.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/somatic_mutations/Strelka2.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/somatic_mutations/VarDict.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/somatic_mutations/VarDict.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/somatic_mutations/VarScan2.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/somatic_mutations/VarScan2.py` (1:1)</a>





### Container Configuration

This utility component provides common functionalities and options for managing the containerized environments (Docker/Singularity) in which the external bioinformatics tools are executed. It ensures consistency in how containers are utilized across different tool execution modules, abstracting away the complexities of container setup and execution parameters.





**Related Classes/Methods**:



- <a href="https://github.com/bioinform/somaticseq/somaticseq/utilities/dockered_pipelines/container_option.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/utilities/dockered_pipelines/container_option.py` (1:1)</a>









### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
129 changes: 129 additions & 0 deletions .codeboarding/Machine_Learning_Output.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
```mermaid

graph LR

XGBoost_Model_Core["XGBoost Model Core"]

TSV_to_VCF_Converter["TSV to VCF Converter"]

Nucleotide_Change_Feature_Generator["Nucleotide Change Feature Generator"]

SomaticSeq_Pipeline_Orchestrator["SomaticSeq Pipeline Orchestrator"]

Genomic_File_Utilities["Genomic File Utilities"]

SomaticSeq_Pipeline_Orchestrator -- "orchestrates" --> XGBoost_Model_Core

SomaticSeq_Pipeline_Orchestrator -- "orchestrates" --> TSV_to_VCF_Converter

XGBoost_Model_Core -- "uses" --> Nucleotide_Change_Feature_Generator

TSV_to_VCF_Converter -- "uses" --> Genomic_File_Utilities

```



[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org)



## Details



This subsystem embodies the core machine learning functionality of `somaticseq`, focusing on the classification of somatic variants using an XGBoost model and the subsequent conversion of results into the standard VCF format. It integrates several key components to achieve this, from feature engineering to final output generation.



### XGBoost Model Core

This component encapsulates the machine learning logic, specifically the training and prediction using the XGBoost algorithm. It takes feature-rich TSV data as input and outputs classification results, including prediction scores and feature importance. It is fundamental because it performs the actual machine learning classification, which is the primary purpose of this subsystem.





**Related Classes/Methods**:



- <a href="https://github.com/bioinform/somaticseq/somaticseq/somatic_xgboost.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/somatic_xgboost.py` (1:1)</a>





### TSV to VCF Converter

Responsible for transforming the classified TSV output from the XGBoost Model Core into the standardized VCF format. It handles the parsing of TSV data, processing variant information, and formatting it into VCF-compliant fields, including quality scores and filtering details. This component is crucial as it translates the internal processing results into a widely accepted and usable genomic data format. It leverages general genomic file utilities for its operations.





**Related Classes/Methods**:



- <a href="https://github.com/bioinform/somaticseq/somaticseq/somatic_tsv2vcf.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/somatic_tsv2vcf.py` (1:1)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/tsv2vcf.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/tsv2vcf.py` (1:1)</a>





### Nucleotide Change Feature Generator

This component identifies and categorizes different types of nucleotide changes (e.g., single nucleotide variants (SNVs), insertions, deletions). This categorization is a crucial step in feature engineering, providing essential input features for the XGBoost Model Core to accurately classify somatic variants. It is fundamental because it prepares the data in a machine-learning-ready format, directly impacting the model's performance.





**Related Classes/Methods**:



- <a href="https://github.com/bioinform/somaticseq/somaticseq/ntchange_type.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/ntchange_type.py` (1:1)</a>





### SomaticSeq Pipeline Orchestrator

This component serves as the high-level coordinator for the entire `somaticseq` pipeline. Within the context of the `Machine Learning & Output` subsystem, it orchestrates the sequential execution of the XGBoost Model Core for classification and the subsequent TSV to VCF Converter for output formatting. It is fundamental as it defines the overall workflow and ensures the correct execution order of the core machine learning and output generation steps.





**Related Classes/Methods**:



- <a href="https://github.com/bioinform/somaticseq/somaticseq/run_somaticseq.py#L1-L1" target="_blank" rel="noopener noreferrer">`somaticseq/run_somaticseq.py` (1:1)</a>





### Genomic File Utilities

General utility functions for parsing and handling genomic file formats.





**Related Classes/Methods**: _None_







### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
107 changes: 107 additions & 0 deletions .codeboarding/Variant_Data_Processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
```mermaid

graph LR

Genomic_File_Parsing_Read_Information_Extraction["Genomic File Parsing & Read Information Extraction"]

Feature_Calculation_Annotation["Feature Calculation & Annotation"]

VCF_to_TSV_Transformation["VCF to TSV Transformation"]

Feature_Calculation_Annotation -- "uses" --> Genomic_File_Parsing_Read_Information_Extraction

VCF_to_TSV_Transformation -- "uses" --> Genomic_File_Parsing_Read_Information_Extraction

VCF_to_TSV_Transformation -- "uses" --> Feature_Calculation_Annotation

```



[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org)



## Details



The `Variant Data Processing` component within `somaticseq` is a critical subsystem responsible for preparing genomic variant data for machine learning applications. It orchestrates the parsing of raw genomic files, the extraction of detailed read-level information, the calculation of comprehensive quantitative features, and the final transformation of data into a machine-learning-ready format.



### Genomic File Parsing & Read Information Extraction

This component serves as the initial gateway for all genomic data. It is responsible for parsing various genomic file formats (e.g., VCF, BAM, pileup) and extracting fundamental read-level information necessary for downstream feature calculation. It provides the basic utilities to read and interpret raw genomic data.





**Related Classes/Methods**:



- <a href="https://github.com/bioinform/somaticseq/somaticseq/genomic_file_parsers/genomic_file_handlers.py#L0-L0" target="_blank" rel="noopener noreferrer">`somaticseq.genomic_file_parsers.genomic_file_handlers` (0:0)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/genomic_file_parsers/read_info_extractor.py#L0-L0" target="_blank" rel="noopener noreferrer">`somaticseq.genomic_file_parsers.read_info_extractor` (0:0)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/genomic_file_parsers/pileup_reader.py#L0-L0" target="_blank" rel="noopener noreferrer">`somaticseq.genomic_file_parsers.pileup_reader` (0:0)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/genomic_file_parsers/pileup_reader.py#L163-L313" target="_blank" rel="noopener noreferrer">`somaticseq.genomic_file_parsers.pileup_reader:Base_calls` (163:313)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/genomic_file_parsers/pileup_reader.py#L13-L160" target="_blank" rel="noopener noreferrer">`somaticseq.genomic_file_parsers.pileup_reader:Pileup_line` (13:160)</a>





### Feature Calculation & Annotation

This component focuses on deriving quantitative features from genomic data. This includes calculating read-level metrics from BAM alignment files and integrating contextual information. It also handles the annotation of variants with these calculated features, which are crucial inputs for machine learning models.





**Related Classes/Methods**:



- <a href="https://github.com/bioinform/somaticseq/somaticseq/bam_features.py#L0-L0" target="_blank" rel="noopener noreferrer">`somaticseq.bam_features` (0:0)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/sequencing_features.py#L0-L0" target="_blank" rel="noopener noreferrer">`somaticseq.sequencing_features` (0:0)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/annotate_caller.py#L0-L0" target="_blank" rel="noopener noreferrer">`somaticseq.annotate_caller` (0:0)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/ntchange_type.py#L0-L0" target="_blank" rel="noopener noreferrer">`somaticseq.ntchange_type` (0:0)</a>





### VCF to TSV Transformation

This component is responsible for converting standardized VCF (Variant Call Format) files into a custom tab-separated value (TSV) format. During this transformation, it integrates the features calculated by the "Feature Calculation & Annotation" component, producing a comprehensive dataset ready for machine learning model training or prediction.





**Related Classes/Methods**:



- <a href="https://github.com/bioinform/somaticseq/somaticseq/somatic_vcf2tsv.py#L0-L0" target="_blank" rel="noopener noreferrer">`somaticseq.somatic_vcf2tsv` (0:0)</a>

- <a href="https://github.com/bioinform/somaticseq/somaticseq/single_sample_vcf2tsv.py#L0-L0" target="_blank" rel="noopener noreferrer">`somaticseq.single_sample_vcf2tsv` (0:0)</a>









### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
Loading