This repository contains code for "StruCloze: A Unified Framework for Backmapping and Inpainting of Biomolecular Structures", which raise a unified framework for back-mapping coarse-grained structures and completing missing residues, nucleotides in existing experimental structures.
- Installation
- Data downloading
- Usage
- Back-mapping coarse-grained structures
- Completing missing tokens
- Analyzing predicted structures
- Train
- Contat
Conda is recommended for setting up dependencies. To quickly set up an environment, run:
# Clone this repository and set up virtual environment
git clone https://github.com/Junjie-Zhu/StruCloze.git
cd StruCloze
# Create and activate environment
conda env create -f environment.yaml
source activate strucloze
# Install strucloze as a package.
pip install -e .
We have uploaded the checkpoints and test dataset to Zenodo.
Test set contains processed entries from PDB (in format .pkl.gz
), in which features are organized as:
# atom features
"atom_positions": torch.FloatTensor # shape: (N_atom, 3)
"atom_to_token_index": torch.LongTensor # shape: (N_atom, ), recording token index for each atom
# residue features
"aatype": torch.LongTensor # shape: (N_residue, ), recording residue type as numbers (range from 0 to 30)
"moltype": torch.IntTensor # shape: (N_residue, ), recording molecule type as numbers (0 for protein, 1 for rna, 2 for dna)
"residue_index": torch.IntTensor # shape: (N_residue, ), residue index in each chain
"token_index": torch.IntTensor # shape: (N_residue, ), token index in the whole bioassembly
"chain_index": torch.IntTensor # shape: (N_residue, )
# CCD reference features
"ref_positions": torch.FloatTensor # shape: (N_residue, 3), reference coordinates
"ref_element": torch.IntTensor # shape: (N_residue, 32), reference element number converted into one-hot
"ref_atom_name_chars": torch.IntTensor # shape: (N_residue, 4, 64), reference atom name converted into one-hot
# CG representations
"atom_ca": torch.FloatTensor # shape: (N_residue, 3), coordinates of CA atoms
"atom_com": torch.FloatTensor # shape: (N_residue, 3), coordinates of center of mass
"ref_ca": torch.FloatTensor # shape: (N_residue, 3), CA atoms in CCD
"ref_com": torch.FloatTensor # shape: (N_residue, 3), center of mass in CCD
"calv_positions": torch.FloatTensor # shape: (N_residue, 3), CCD reference positions aligned to atom positions
Please note that these data are saved as numpy arrays and are processed into tensors in src/data/transform.py
in iterating dataset.
By default, calv_positions
are aligned with Martini for proteins and CALVADOS RNA for nucleic acids.
Extracting CG models from all-atom structures can be done with scripts/get_cg_repr.py
:
python scripts/get_cg_repr.py \
-i ./example/cif \
-o ./example/pkl
Currently, CG models are extracted as numpy arrays, we will update pdb format output in the future.
To backmap CG models with StruCloze, you may refer to the following command:
python src/inference.py \
ckpt_dir=/path/to/checkpoint.ckpt \
data.path_to_dataset=./example/pkl \
data.repr=COM # or CA, or calv_RNA
If data.path_to_dataset
is set as a directory, it will automatically search for all .pdb
files in the directory and back-map them. Metadata file is also supported by setting data.path_to_dataset
as a file path (.csv
).
Predicting missing residues use exactly the same model, but with a finetuned checkpoint. Usage is the same as having described above.
We provided structure analysis scripts in scripts/
. For structures predicted by StruCloze, you may use scripts/analyze_pdb_simple.py
:
python scripts/analyze_pdb_simple.py \
-i /path/to/predicted/structure/ \
-o /path/to/output/metrics/ \
-r /path/to/reference/structure/feature/
-p /path/to/metadata.csv # optional
Here the pre-extracted structure features in .pkl.gz
format are used. If structure features are not available, you may use scripts/analyze_pdb.py
in which reference structures in mmcif
or pdb
format is required instead of features.
There may still occur some errors due to missing residues or weird indexing in reference structures in current scripts, but it handles most siuations in our analysis.
With our provided dataset (or your own dataset), you can train StruCloze with the following command:
python src/train.py \
epoch=200 \
resume.ckpt_dir=/path/to/your/checkpoint.ckpt \
data.path_to_dataset=/path/to/your/dataset.csv # a metadata file is recommended
You may alse revise arguments in configs/config_train.yaml
. For DDP, batch_size
is set to number of samples on a single card, we use torchrun
for DDP:
torchrun --nproc_per_node=8 src/train.py
Currently there are some known bugs in code listed here:
- The
OXT
atom at the C-terminus is neglected in the prediction for convenience (As all other residues do not contain this atom, we remove it from CCD dictionary). - Chain IDs in predicted structures starts from
a
but notA
.
Any suggestions or bug reports are welcome. Please open an issue in the GitHub repository.
If you have any questions, please contact me via email: shiroyuki@sjtu.edu.cn Or open an issue in the GitHub repository.
- Checkpoints will be uploaded later.
- Detailed check of the code will be done later.