The raw extracted data from IEDB and VDJDB is already available in this repo in data/iedb-vdjdb/raw
Alternatively, if you're interested in re-running our data extraction, clone our fork of the IEDB_IMMREP data repo and run the data extraction script.
git clone https://github.com/ljwoods2/IEDB_IMMREP.git
cd IEDB_IMMREP
git checkout new-categories
python setup.py install
chmod +x run.sh
./run.sh
Unique, formatted triads are available by category (species and MHC class) in data/iedb-vdjdb/iedb and data/iedb-vdjdb/vdjdb in the format described in tcr_format_parsers. All categories are also available in parquet format with duplicates allowed (non-unique) to allow for storing DB metadata.
If you're interested in re-running our formatting and non-cognate triad creation code, first create a conda environment containing the necessary dependencies:
conda create -n af3-analyzer --file envs/af3-analyzer.yaml
Then, run each cell sequentially in data/iedb-vdjdb/reformat.ipynb using the af3-analyzer
environment as the kernel.
Our lab used the nextflow pipelines in the af3-nf repo for running inference on triads. While these pipelines were designed to run on TGen's Gemini supercomputer, they can be easily adapted to run in other environments. Please contact the authors for details.
See data/iedb-vdjdb/iedb/human_I/run_af3_triad.sh for an example slurm script that runs the pipelines.
Blast+ for pdb alignment
conda create -n blast --file envs/blast.yaml
cd /path/to/pdbaa/dir
update_blastdb.pl --decompress pdbaa
cd data/iedb-vdjdb
blastp -query fasta_queries/all_triads.fasta -db /path/to/pdbaa/dir -out pdb_blast_results/blast_result.csv -outfmt 10
Unique, formatted PDB triads are available in data/pdb/pdb_triads.csv. Non-unique, formatted triads are available in parquet format.
Raw PDB summary files in data/pdb/raw come from STCRDab.
If you're interested in re-running our formatting code, first clone the IMGTHLA repo (this is used to identify likely MHC alleles for each sequence):
git clone https://github.com/ANHIG/IMGTHLA
Then, run each cell sequentially in data/pdb/reformat.ipynb using the af3-analyzer
environment as the kernel, making sure to modify the variable IMGT_HLA_PATH
with your own path to the cloned IMGTHLA repo.