This README describes the analysis of the deep mutational scanning experiment for H3N2 A/Moscow/10/1999 (Mos99) neuraminidase (NA).
- All raw reads in fastq format from NIH SRA database BioProject PRJNA857746 should be placed in fastq/
- ./Fasta/Mos99_NA.pep: Mos99 NA protein sequence.
- ./Fasta/Human_H3N2_NA_2020.aln: Full-length NA protein sequences from human H3N2 strains (downloaded from GISAID).
- ./Fasta/Avian_N2_NA.aln: Full-length NA protein sequences from Avian N2 strains (downloaded from GISAID).
- ./Fasta/NA_subtypes.aln: Full-length representative NA protein sequences from different NA subtypes (downloaded from GISAID).
- ./data/ASA.table: Amino acid solvent accessibility (ASA) from Tien et al. 2013.
- ./data/sites_info.tsv: Antigenic regions and acive site residues are defined by Colman et al. 1983 and McAuley et al. 2019, respectively.
- ./data/foldx_msa_transformer.csv: Stability effect was predicted using FoldX and natural fitness was inferred using MSA Transformer.
- ./PDB/Mos99_WT_NA_monomer.pdb
- ./PDB/Mos99_WT_NA_tetramer.pdb
- ./PDB/Mos99_WT_sialic_acid_final.pdb
- Install dependencies by conda:
conda create -n NA -c bioconda -c anaconda -c conda-forge \
python=3.9 \
seqtk \
flash \
biopython \
cutadapt \
snakemake \
prody
- Activate conda environment:
conda activate NA
-
Using UMI to correct sequencing errors:
python3 script/Dedup_UMI.py fastq NNNNNNN 0.8 2
-
Counting mutations:
snakemake -s script/Mos99_pipeline.smk -j 10
-
Convert counts to fitness:
python3 script/count2fitness.py
- Output file: ./result/Mos99_fit.csv
-
Compute mutational tolerance for each residue
python3 script/Mean_mut_fit_per_resi.py
- Input file: ./result/Mos99_fit.csv
- Output file: ./result/Mos99_mean_mut_fit.tsv
-
Assign residue type and calculate RSA
python3 script/pos_type_analysis.py
-
Calculate distance to active site for each residue
python3 script/Dist_analysis.py
- Input file: ./PDB/Mos99_WT_sialic_acid_final.pdb
- Output file: ./result/Dist_to_active_site.tsv
-
Calculate natural mutation frequency
python3 script/natural_mut_analysis.py
- Input files:
- Output file:
-
Plots for checking data quality
Rscript script/plot_QC.R
- Input file: ./result/Mos99_fit.csv
- Output file:
-
Comparing the data in this study with our previous study (Wang et al. 2021)
Rscript script/plot_cross_valid.R
- Input files:
- Output file: ./graph/DMS_cross_validate.png
-
Heatmap of mutational fitness
Rscript script/plot_fitness_heatmap.R
- Input file: ./result/Mos99_fit.csv
- Output file: ./graph/Mos99_fit_heatmap.png
-
Compare RSA and fit across residue types
Rscript script/plot_pos_type_analysis.R
-
Plot correlation between fitness and distance to active site
Rscript script/plot_dist_to_active_site.R
-
Plot correlation between fitness and natural mutation frequency
Rscript script/plot_natural_mut_fit.R
- Input file: ./result/N2_mutation_freq.tsv
- Ouput file: ./graph/natural_mut_freq_fit.png
-
Plot DMS fitness vs predicted stability effect using FoldX and predicted fitness using MSA Transformer
Rscript script/plot_fit_vs_predict.R
-
Compare distribution of fitness effects between naturally observed vs unobserved mutations
Rscript script/plot_fit_conserved.R
- Input files:
- Output file: ./graph/fit_dist_nat_vs_unnat.png
-
Plot sequence logos for residues in cluster 2
python3 script/cluster2_seqlogo.py