This Colab-based tool automates the process of identifying the phylogenetic distribution of one or more genes across a user-defined set of genomes. It generates a presence/absence matrix and visualizes it alongside a taxonomic tree as a heatmap.
- 𧬠Accepts a set of protein sequences
- π Accepts a list of target organisms (or a short default list is available)
- β¬οΈ Automatically retrieves proteomes (.faa files) from NCBI RefSeq
- π§ͺ Runs BLASTP searches to detect gene presence
- π Builds a presence/absence matrix
- π³ Constructs a taxonomic tree from NCBI
- π¨ Plots a heatmap aligned to the tree
All of this is performed entirely in Google Colab β no local installation required.
You need to upload two files to the Colab environment before running the notebook:
This file can be provided in either of the following formats:
A standard FASTA file containing one or more amino acid sequences. Each entry should have:
- A header line beginning with
>gene_name
- The sequence on one or more lines
A plain text file with one UniProt ID per line. The notebook will automatically download the corresponding protein sequences.
A plain text file with one organism name per line. These should be recognized species from NCBI.
π¦ Don't have a list?
If no file is uploaded, the notebook will fall back to a default list of representative organisms, which includes a few common model species across different branches of life.
π Open the notebook in Colab
Steps:
- Upload
query_genes.txt
andorganism_list.txt
- Run the notebook cells
- Collect outputs:
- CSV matrix:
profile.csv
- Heatmap figure:
tree_and_heatmap.png
- CSV matrix: