This README documents the complete pipeline, setup, troubleshooting, and experimental execution history for running AlphaPulldown and AlphaFold3 workflows for protein complex structure prediction. All content is designed for reproducibility and future reference.
- TODO
- Reference Links
- File Transfer & Utilities
- Expected Data from AlphaFold3
- AlphaPulldown Overview
- Environment Setup
- Feature Generation
- Understanding
create_individual_features.py
- Accessing Databases
- Multimer Job Execution
- Result Analysis
- AlphaFold3 Notes
- Troubleshooting & Exceptions
- Check
missing_proteins.txt
for Feature Database completeness. - Investigate MMseqs2 errors and output inconsistencies.
- Protein sequences: https://www.uniprot.org
- REST API example: https://www.uniprot.org/uniprotkb/Q04740/entry
- Paper: https://academic.oup.com/bioinformatics/article/39/1/btac749/6839971
- GitHub: https://github.com/KosinskiLab/AlphaPulldown
- SBGrid example: https://sbgrid.org/wiki/examples/alphafold2
- GitHub: https://github.com/google-deepmind/alphafold
# Transfer files (adjust paths as needed):
scp /local/path/to/file jzho349@kilimanjaro.biochem.emory.edu:/remote/path
scp -r jzho349@kilimanjaro.biochem.emory.edu:/remote/result/path ~/Downloads/
# Split protein FASTA list into 8 chunks:
split -n l/8 --suffix-length=2 --additional-suffix=.txt unique_protein_ids_Jack.txt target_chunk_Jack_
# One-liner tracing number of jobs completed
for d in gpu_*; do echo "$d: $(ls -1 "$d" | wc -l)"; done
- mmCIF: Predicted complex structure.
- pLDDT: Per-residue confidence score.
- Possibly more outputs such as ipTM or pairwise confidence (to be explored).
AlphaPulldown consists of two core stages:
-
create_individual_features.py
:- Computes MSAs and finds templates.
- Stores monomer features as
.pkl
files.
-
- Predicts structures based on the generated features.
# Create environment
conda create -n AlphaPulldown -c omnia -c bioconda -c conda-forge \
python==3.11 openmm==8.0 pdbfixer==1.9 kalign2 hhsuite hmmer modelcif
# Activate & install JAX for GPU acceleration
conda activate AlphaPulldown
pip install -U "jax[cuda12]"
NUM_SEQ=$(grep -c "^>" protein_sequences.fasta)
for ((i=0; i<$NUM_SEQ; i++)); do
echo "Processing index $i"
create_individual_features.py \
--fasta_paths=$FASTA_PATH \
--data_dir=$DATA_DIR \
--output_dir=$OUTPUT_DIR \
--skip_existing=True \
--use_mmseqs2=True \
--max_template_date="2050-01-01" \
--seq_index=$i || echo $i >> failed_indexes.txt
done
MAX_JOBS=16
rm -f $FAILED_LOG
for ((i=0; i<$NUM_SEQ; i++)); do
(
create_individual_features.py \
--fasta_paths=$FASTA_PATH \
--data_dir=$DATA_DIR \
--output_dir=$OUTPUT_DIR \
--skip_existing=True \
--use_mmseqs2=False \
--max_template_date="2050-01-01" \
--seq_index=$i || echo $i >> $FAILED_LOG
) &
while (( $(jobs -r | wc -l) >= MAX_JOBS )); do sleep 1; done
done
wait
- CPU-only script
- Output format: Pickle files (
.pkl
) - Requires pre-downloaded Genetic Database
- Feature sources:
- JackHMMer/HHblits (default)
- MMseqs2
- Precomputed Feature Pickles
- Genetic DB: https://github.com/KosinskiLab/alphafold#genetic-databases
- Feature DB: https://alphapulldown.s3.embl.de/index.html
bash download_all_data.sh /data7/Conny/data/AF_GeneticDB
# Install MinIO
curl -O https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc && mkdir -p $HOME/bin && mv mc $HOME/bin/
echo 'export PATH=$HOME/bin:$PATH' >> ~/.bashrc && source ~/.bashrc
# Example download
mc cp embl/alphapulldown/input_features/Saccharomyces_cerevisiae/Q01329.pkl.xz /data7/Conny/data/JackFeaturePickleDB
# Batch download
bash download_found.sh | tee download_output.log
xz -dk *.xz # decompress while keeping original
Parallel execution across 8 GPUs:
TARGET_FILES=(
/data7/Conny/data/target_chunk_aa
/data7/Conny/data/target_chunk_ab
...
/data7/Conny/data/target_chunk_ah
)
for i in ${!TARGET_FILES[@]}; do
CUDA_VISIBLE_DEVICES=$i run_multimer_jobs.py \
--mode=pulldown \
--monomer_objects_dir=$MONOMER_OBJECTS_DIR \
--protein_lists=$(realpath $BAIT_FILE),${TARGET_FILES[$i]} \
--output_path=$BASE_OUTPUT/gpu_$i \
--data_dir=$DATA_DIR \
--num_cycle=3 \
--num_predictions_per_model=1 \
> $BASE_OUTPUT/gpu_$i/run.log 2>&1 &
done
wait
- Create Jupyter notebooks using:
create_notebook.py --cutoff=5.0 --output_dir=/path/to/results
- Visualize outputs (e.g., PAE, ipTM, ranked pdbs)
- For full table generation:
- Build
apptainer
image with CCP4 libraries - Execute:
- Build
apptainer exec --no-home --bind /results:/mnt fold_analysis_final.sif /app/run_get_good_pae.sh --output_dir=/mnt --cutoff=10
- GUI: Automated MSA
- Local mode: Accepts MSA input via JSON
- Additional work needed to automate Selenium upload process (autoclicker test in progress)
- UniRef30 issue → Fix with: https://github.com/google-deepmind/alphafold/pull/860/files
- MMseqs2: 5 failures → P40449, P23369, Q12754, Q22354, Q12019
- JackHMMer: 3 known failures → P40328, P48570, P32432,P38691,P48510,P53920,Q00582 → cd /data7/Conny/result_JackHMMer/missing_pkls.txt
- FeatureDB: 20 missing pickles → cd /data7/Conny/data/JackFeaturePickleDB/missing_pickle_proteins.txt
- AF3: 1 missing pickles due to large size → Job_999
This document is actively maintained as experimental runs progress.