SIGNAL: A high-throughput pipeline for large-scale analysis of microbial signal transduction systems
🚧 This project is under active development. Expect frequent changes.
SIGNAL (Systematic Investigation of Genomic Networks for Analysis of Logic-based signaling) is a pipeline designed to perform high-throughput analysis of bacterial and archaeal genomes to uncover patterns in signal transduction systems across genomes, taxonomic groups, and functional architectures. The current dataset includes 26,221 representative genomes.
A standard computer with sufficient RAM for in-memory processing should be adequate.
- Python 3.6 or higher
- Ubuntu 20.04
- Linux Mint 20.2
Clone the repository using Git:
git clone https://github.com/ToshkaDev/signal-transduction.git
This will download the repository and set up the pipeline for use.
To launch the pipeline, use the master script:
cd signal-transduction
./analyze.sh
The script will first check whether the initial long-running step has already been completed by examining the presence of files in results/obtain_and_process_st/. Based on this, it will either start the entire pipeline or skip the completed steps.
- Unpacks archived input files
- Extracts genome lists (bacterial and archaeal)
- Assigns genome sources (MiST genomes or MiST MAGs databases)
- Creates all necessary input and output directories
Input files include:
- A dataset of 26,221 bacterial and archaeal genomes. It is also possible to use your own list of genomes prepared in accordance with the format used.
- Signal transduction domain definitions from the MiST (Microbial Signal Transduction) database
- A metadata file from the Genome Taxonomy Database (GTDB, release r214)
- Fetches signal transduction systems (two-component and one-component) from the MiST database using its API
- Analyzes protein domain compositions and architectures
- Outputs tabulated results listing:
- Genomes
- Histidine kinases (HKs), response regulators (RRs), and one-component systmes (OCP)
- Their protein domain compositions and architectures
- Analyzes and reports domain composition statistics for HKs, RRs, and OCPs per genome
- Reports:
- Number and type of input domains in HKs and OCPs
- Additional domains in RRs
- Normalizes statistics by genome size and total number of encoded proteins
- Analyzes domain composition statistics for HKs, RRs, and OCPs at each taxonomic level:
- Species
- Genus
- Family
- Order
- Class
- Phylum
- Kingdom
- Normalizes results by the number of genomes per taxonomic level
The GTDB taxonomy is used.
- Visualization modules for domain architecture patterns