Analyzing evolution as pathways through latent representation space. This repository analyzes SARS-CoV-2 spike protein or SARS-CoV-2 full genome sequence. The former uses the ESM-2 protein language model that was trained on a large corpus of protein sequences including viral proteins. There is optional fine tuning of ESM-2 with SARS-CoV-2 spike sequences. The latter uses a custom latent diffusion model implemented with PyTorch / Hugging Face. This is analogous to the Stable Diffusion strategy.
Working from https://github.com/facebookresearch/esm.
Install PyTorch (on Mac via Homebrew, other systems will need different installation)
brew install pytorch
Install ESM, Hugging Face and various dependencies with pip:
pip install -r requirements.txt
Using parameters in config.yaml
under ESM_fine_tune
.
Run snakemake --cores 1 -p --snakefile ESM_fine_tune.smk
This outputs models/esm.bin
. This requires decent GPU resources, where --batch-size 8
of has been tuned for a single NVIDIA 96Gb H100 GPU.
Using parameters in config.yaml
under ESM_embeddings
.
Run snakemake --cores 1 -p --snakefile ESM_embeddings.smk
This outputs
results/log_likelihoods.tsv
results/embeddings.tsv
results/ordination.tsv
Run snakemake --cores 1 -p --snakefile latent_diffusion.smk data/alignment.fasta data/metadata.tsv
Run snakemake --cores 1 -p --snakefile latent_diffusion.smk models/vae.pth models/diffusion.pth
Run snakemake --cores 1 -p --snakefile latent_diffusion.smk results/generated.fasta
Run snakemake --cores 1 -p --snakefile latent_diffusion.smk results/ordination.tsv
Working from the new harmony nodes on the Fred Hutch cluster
Log into cluster
ssh tbedford@maestro.fhcrc.org
Set up for ESM
module load snakemake PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
pip install --user fair-esm nextstrain-augur transformers
Set up for latent diffusion
module load snakemake PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
pip install --user nextstrain-augur diffusers["torch"] transformers
Running
# Submit worklow
sbatch --partition=chorus --gpus=1 snakemake --cores 1 -p --snakefile latent_diffusion.smk
# Interactive session
srun --pty -c 6 -t "2-0" --gpus=1 -p chorus /bin/zsh -i
Update ESM scripts
scp ESM/* tbedford@maestro.fhcrc.org:~/embedded-pathways/ESM/
Update latent diffusion scripts
scp latent-diffusion/* tbedford@maestro.fhcrc.org:~/embedded-pathways/latent-diffusion/
Grab remote results
scp -r tbedford@maestro.fhcrc.org:~/embedded-pathways/results/* results/
VAE model is 154M parameters
Using U-Net. Diffusion model is 15M parameters
This repository is licensed under the MIT License. See the LICENSE file for details.
Some portions of the code in this repository were generated with the assistance of large language models (LLMs), primarily ChatGPT. Individual scripts are commented to state their provenance. While I have reviewed, modified, and integrated these contributions, the copyright status of LLM-generated code is uncertain and may vary depending on jurisdiction.
As a result:
- Human-Authored Contributions: Code written by me (the repository owner) is explicitly licensed under the MIT License and is subject to the terms outlined in the LICENSE file.
- LLM-Generated Contributions: For any portions of the code generated by LLMs, I do not assert copyright ownership and disclaim any responsibility for the originality or copyright status of such code.
- User Responsibility: Users of this repository are encouraged to independently verify the legal status of any LLM-generated portions of the code before reuse or redistribution.