protein-detective

Python package to detect proteins in EM density maps.

It uses

Uniprot Sparql endpoint to search for proteins and their measured or predicted 3D structures.
powerfit to fit protein structure in a Electron Microscopy (EM) density map.

An example workflow:

graph LR;
    search{Search UniprotKB} --> |uniprot_accessions|fetchpdbe{Retrieve PDBe}
    search{Search UniprotKB} --> |uniprot_accessions|fetchad{Retrieve AlphaFold}
    fetchpdbe -->|mmcif_files| residuefilter{Filter on nr residues + write chain A}
    fetchad -->|pdb_files| densityfilter{Filter out low confidence}
    residuefilter -->|pdb_files| powerfit
    densityfilter -->|pdb_files| powerfit
    powerfit -->|*/solutions.out| solutions{Best scoring solutions}
    solutions -->|dataframe| fitmodels{Fit models}

Install

pip install protein-detective

Or to use the latest development version:

pip install git+https://github.com/haddocking/protein-detective.git

Usage

The main entry point is the protein-detective command line tool which has multiple subcommands to perform actions.

To use programmaticly, see the notebooks and API documentation.

Search Uniprot for structures

protein-detective search \
    --taxon-id 9606 \
    --reviewed \
    --subcellular-location-uniprot nucleus \
    --subcellular-location-go GO:0005634 \
    --molecular-function-go GO:0003677 \
    --limit 100 \
    ./mysession

(GO:0005634 is "Nucleus" and GO:0003677 is "DNA binding")

In ./mysession directory, you will find session.db file, which is a DuckDB database with search results.

To retrieve a bunch of structures

protein-detective retrieve ./mysession

In ./mysession directory, you will find mmCIF files from PDBe and PDB files and AlphaFold DB.

To filter AlphaFold structures on confidence

Filter AlphaFoldDB structures based on density confidence. Keeps entries with requested number of residues which have a confidence score above the threshold. Also writes pdb files with only those residues.

protein-detective density-filter \
    --confidence-threshold 50 \
    --min-residues 100 \
    --max-residues 1000 \
    ./mysession

To prune PDBe files

Make PDBe files smaller by only keeping first chain of found uniprot entry and renaming to chain A.

protein-detective prune-pdbs \
    --min-residues 100 \
    --max-residues 1000 \
    ./mysession

Powerfit

Rotate and translate the prepared structures to fit and score them into the EM density map using powerfit.

protein-detective powerfit run ../powerfit-tutorial/ribosome-KsgA.map 13 docs/session1

This will use dask-distributed to run powerfit for each structure in parallel on multiple CPU cores or GPUs.

Run powerfits on Slurm

You can use dask-jobqueue to run the powerfits on a Slurm deployment on multiple machines on a shared filesystem.

In one terminal start the Dask cluster with

pip install dask-jobqueue
python3

from dask_jobqueue import SLURMCluster

cluster = SLURMCluster(cores=8,
                       processes=4,
                       memory="16GB",
                       queue="normal")
print(cluster.scheduler_address)
# Prints something like: 'tcp://192.168.1.1:34059'
# Keep this Python process running until powerfits are done

In second terminal, run the powerfits on Dask cluster with

protein-detective powerfit run ../powerfit-tutorial/ribosome-KsgA.map 13 docs/session1 --scheduler-address tcp://192.168.1.1:34059

How to run efficiently

Powerfit is quickest on GPU, but can also run on CPU.

To run powerfits on a GPU you can use the --gpu <workers_per_gpu>. The value of workers_per_gpu should be high enough so the GPU is fully utilized. You can start with 1 (the default) and monitor the GPU usage with nvtop if you see that the GPU is not 100% loaded, you can increase the number until there are no more valleys in the GPU usage graph.

If you have multiple GPUs, then --gpu 2 will run powerfits on all GPUs and run 2 powerfits concurrently on each GPU.

If you do not use --gpu flag, then powerfit will run on CPU. By default each powerfit will use 1 CPU core and run multiple powerfits in parallel according to the number of physical CPU cores available on the machine (so excluding hyperthreaded cores).

You can set the --nproc <int> so each powerfit will use that many CPU cores. This is useful if you have more CPU cores available then there are structures to fit. If the number of structure to fit is greater than available CPU cores then using the default (1 core per powerfit) is recommended.

Alternativly run powerfit yourself

You can use the protein-detective powerfit commands to print the commands.

The commands can then be run in whatever way you prefer, like sequentially, with GNU parallel, or as a Slurm array job.

For example to run with parallel and 4 slots:

protein-detective powerfit commands ../powerfit-tutorial/ribosome-KsgA.map 13 docs/session1 > commands.txt
parallel --jobs 4 < commands.txt

To print top 10 solutions to the terminal, you can use:

protein-detective powerfit report docs/session1

Outputs something like:

powerfit_run_id,structure,rank,cc,fishz,relz,translation,rotation,pdb_id,pdb_file,uniprot_acc
10,A8MT69_pdb4e45.ent_B2A,1,0.432,0.463,10.091,227.18:242.53:211.83,0.0:1.0:1.0:0.0:0.0:1.0:1.0:0.0:0.0,4E45,docs/session1/single_chain/A8MT69_pdb4e45.ent_B2A.pdb,A8MT69
10,A8MT69_pdb4ne5.ent_B2A,1,0.423,0.452,10.053,227.18:242.53:214.9,0.0:-0.0:-0.0:-0.604:0.797:0.0:0.797:0.604:0.0,4NE5,docs/session1/single_chain/A8MT69_pdb4ne5.ent_B2A.pdb,A8MT69
...

To generate model PDB files rotated/translated to PowerFit solutions, you can use:

protein-detective powerfit fit-models docs/session1

Contributing

For development information and contribution guidelines, please see CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
.github/workflows		.github/workflows
docs		docs
src/protein_detective		src/protein_detective
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

protein-detective

Install

Usage

Search Uniprot for structures

To retrieve a bunch of structures

To filter AlphaFold structures on confidence

To prune PDBe files

Powerfit

Contributing

About

Uh oh!

Releases 3

Contributors 2

Uh oh!

Languages

License

haddocking/protein-detective

Folders and files

Latest commit

History

Repository files navigation

protein-detective

Install

Usage

Search Uniprot for structures

To retrieve a bunch of structures

To filter AlphaFold structures on confidence

To prune PDBe files

Powerfit

Contributing

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors 2

Uh oh!

Languages