Latest release | |
---|---|
Status | |
Community |
Important
This package relies on a feature branch of MDAnalysis to parse mmCIF files and therefore should be considered experimental. In the future, if this feature is merged, this package will use the official release. For now, please carefully validate your selections and see tests to see what is currently validated.
AlphaFold3 outputs a set of confidence metrics that are useful for i.e. protein binding predictions, however, making use of these metrics requires careful and time consuming parsing.
MDAnalysis provides an atom selection language (think SQL for molecular topologies) that makes associating confidence metrics with molecular positions, amino acid/atom type, and other topological information easy.
This package seeks to expose AF3 outputs (including confidence metrics) and predicted topologies via an easy-to-use interface.
Get a python interface into my AF3 output?
# example inference output
from mdaf3.data.files import UNCOMPRESSED_AF3_OUTPUT_PATH
from mdaf3.AF3OutputParser import AF3Output
# Equivalent to AF3Output("/path/to/inference/output")
af3_output = AF3Output(UNCOMPRESSED_AF3_OUTPUT_PATH)
Get summary information about the AF3 inference job?
summary_dict = af3_output.get_summary_metrics()
chain_pair_iptm_ndarr = summary_dict["chain_pair_iptm"]
ranking_score = summary_dict["ranking_score"]
# all 'get_' methods take an optional "seed" and "sample_num" argument
# if not provided, the best model (by AF3 ranking score)
# is returned
summary_dict_seed_1 = af3_output.get_summary_metrics(seed=1)
Calculate the mean pLDDT of atoms in a particular protein?
u = af3_output.get_mda_universe()
# segid in MDAnalysis corresponds to protein ID in AF3 input JSON
# https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md
selection = u.select_atoms("segid A")
# pLDDT is stored in the "tempfactors" attribute of an
# MDAnalysis AtomGroup
mean_pLDDT_chain = selection.tempfactors.mean()
Find the minimum PAE among all residue pairs between two protein chains?
u = af3_output.get_mda_universe()
# move up the topology heirarchy
# from atom -> amino acid residue
# using AtomGroup.residues
protein_1_all_res = u.select_atoms("segid A").residues
protein_2_all_res = u.select_atoms("segid B").residues
pae_ndarr = af3_output.get_pae_ndarr()
# resindices are 0-indexed amino acid residue indices
# that correspond to AF3 token indices
min_pae_p1_p2 = pae_ndarr[protein_1_all_res.resindices][
:, protein_2_all_res.resindices
].min()
# alternatively:
min_pae_p1_p2 = af3_output.get_summary_metrics()["chain_pair_pae_min"][0][1]
Find the max contact probability between a single residue of one protein chain and any residue in another protein chain?
u = af3_output.get_mda_universe()
protein_res_1 = u.select_atoms("segid A").residues[0]
protein_2_all_res = u.select_atoms("segid B").residues
contact_prob_ndarr = af3_output.get_contact_prob_ndarr()
max_contact_prob_res1_p2 = contact_prob_ndarr[protein_res_1.resindex][
protein_2_all_res.resindices
].max()
Find the mean pLDDT of all atoms that are within 5 angstroms of a particular residue?
u = af3_output.get_mda_universe(seed=1)
particular_residue = u.select_atoms("segid A").residues[0].atoms
atoms_around_particular_res = u.select_atoms(
"around 5 group pr", pr=particular_residue
)
mean_pLDDT_around_pr = atoms_around_particular_res.tempfactors.mean()
Batch apply a feature extraction method to all my AF3 jobs (with job names stored in a Polars DataFrame)?
from pathlib import Path
import polars as pl
from mdaf3.AF3OutputParser import AF3Output
from mdaf3.FeatureExtraction import serial_apply, split_apply_combine
def extract_protein1_mean_pLDDT(row, af3_parent_dir):
job_dir = Path(af3_parent_dir) / row["job_name"]
af3_output = AF3Output(job_dir)
u = af3_output.get_mda_universe()
protein1_mean_pLDDT = u.select_atoms("segid A").tempfactors.mean()
row["protein1_mean_pLDDT"] = protein1_mean_pLDDT
return row
all_jobs = pl.DataFrame({"job_name": ["93f0240a1d2c15da9551841d22239d41"]})
af3_parent_dir = "mdaf3/data"
# use split_apply_combine for process-parallel execution.
# these methods will convert each row into a dict,
# pass it to your extraction method,
# and then concat the resulting pl.DataFrames
all_job_with_feat = serial_apply(
all_jobs, extract_protein1_mean_pLDDT, af3_parent_dir
)
feature_np = (
all_job_with_feat.select("protein1_mean_pLDDT").to_series().to_numpy()
)
Compress my AF3 output directory without losing confidence metric precision?
Note
This will delete 'TERMS_OF_USE.md' as well as the input JSON for the AF3 job ('<job_name>_data.json') among other things. This feature is designed with large HPC batches in mind, so if you aren't sure, read the compression code!
af3_output = AF3Output(UNCOMPRESSED_AF3_OUTPUT_PATH)
af3_output.compress()
Below we provide instructions both for conda
and
for pip
.
First, clone the repo locally:
git clone https://github.com/ljwoods2/mdaf3.git
cd mdaf3
Ensure that you have conda installed.
Create a virtual environment and activate it:
conda create --name mdaf3
conda activate mdaf3
Install the dependencies:
conda env update --name mdaf3 --file devtools/conda-envs/test_env.yaml
Build this package from source:
pip install -e .
If you want to update your dependencies (which can be risky!), run:
conda update --all
And when you are finished, you can exit the virtual environment with:
conda deactivate
To build the package from source, run:
pip install .
If you want to create a development environment, install the dependencies required for tests and docs with:
pip install ".[test]"
mdaf3 is bound by a Code of Conduct.
The mdaf3 source code is hosted at https://github.com/ljwoods2/mdaf3 and is available under the GNU General Public License, version 2 (see the file LICENSE).
Copyright (c) 2025, Lawson Woods
Project based on the MDAnalysis Cookiecutter version 0.1. Please cite MDAnalysis when using mdaf3 in published work.