This pipeline tries to find putative protein functions for unknown phage proteins using a combination of sequence- and structure-based inference methods.
MMseqs2 clustering and InterProScan are used for sequence-based inference. ColabFold and Foldseek are used for structure-based inference.
The pipeline is build using Nextflow and Python. All scripts can found in bin/
.
The pipeline can be run using the two Docker images provided in dockerfiles
.

- Input proteins are filtered based on their annotation. Proteins annotated with hypothetical, putative or postulated are labelled as unknown. The rest as known.
- Proteins are clusters twice using
MMseqs2
. First to reduce redundance, and second to group related proteins. The cluster representatives are used for further analysis.- Unknown proteins that are clustered together with known proteins are assigned the function of the known protein.
- Unknown proteins get assigned sequence domains using
InterProScan
. - All sequence representatives are input to
ColabFold
for structure prediction. - Protein structures from Unknown proteins are aligned against a structure database using
Foldseek
. - Optionally use structures of known proteins to align against a reference database of known protein structures to verify accuracy of predicted structures.
- Generate a JSON-report of findings for each input phage.
- Re-annotate GFF files with putative functions and domains.
The standard command to run the pipeline is as follows:
nextflow run main.nf \
--foldseekdb=databaseDirectory/db/db_name \
--colabdb=databaseDirectory/colabfolddb \
--email=valid-E-mail-adress
To run without the colabfold structure database and with the Foldseek database downloaded like instructed below:
nextflow run main.nf \
--foldseekdb=databases/af_prot/db \
--wsl \
--email=valid-E-mail-adress
For further information run:
nextflow run main.nf --help
Results are stored in the results
directory. Files for each major step are stored in sepereate sub-folders.
Docker images can be installed with:
docker build -t phagomics dockerfiles/docker_phage_image
and
docker build -t colabfold dockerfiles/docker_colabfold
The Foldseek database (Alphafold/Proteome) can be downloaded via the docker image using the following commands:
mkdir -p databases/af_prot
docker run --rm \
-v ./databases/af_prot:/data \
phagomics \
foldseek databases Alphafold/Proteome /data/db /data/tmp
For information on how to install the colabfold search database (~2TB!) used for structure prediction, visit: https://colabfold.mmseqs.com/
bin/analysisScripts/analysis.py
, bin/analysisScripts/visualizationScripts.py
and bin/analysisScripts/getPdbs.py
are the main scipts and can be used with default pipeline execution (fingers crossed).
analysis.py
calculates statistics about the results. Also must be run before getPdbs.py
.
getPdbs.py
gets the .pdb files (protein structure files) from the colabfold subfolder, for structures that had a putative function assigned through Foldseek
. These can then be clustered using Foldseek to identify similar predicted structures.
For more details and an example workflow see bin/analysisScripts/analysis_workflow.md
visualizationScripts.py
creates three figures:
- Barplot showing the ratio of known to unknown proteins per phage.
- Barplot showing the cluster sizes and makeup.
- Scatterplot showing cluster heterogeneity.