Genomic and Geographic Adaptations to Climate Change: A Comparative Study of Wild Yak, Takin, and High-Altitude Bovids
Wild Yak faces substantial habitat loss and altitudinal displacement by 2050, driven by climate change and limited adaptive genomic traits.
Computationally intensive analyses like InterProScan and BLAST averaged 2β4 days each for a single species to complete!
This project investigates how high-altitude bovids-Wild Yak, Takin, and Water Buffalo-adapt to climate change. It combines species distribution modeling and comparative genomics to find out:
-
Where their habitats are now and where they'll likely shift in the future
-
What genes and protein functions help them survive in extreme environments
It uses Python and R with machine learning (Random Forest), spatial analysis (centroid tracking, jittering), and genomic tools like Mash, InterProScan, ProteinOrtho, and GO enrichment.
Key finding for SDM:
- Wild yak habitats are shrinking and moving uphill. Takin habitats are expanding. These results help guide future conservation efforts.
Key finding for Genomic Analysis:
- Wild yaks have fewer heat shock genes, and due to their thick fur, they struggle to regulate body heat. They are adapted to cold climates at elevations around 3,000 feet and cannot tolerate warmer temperatures.

This part of the project investigates genetic adaptations of Wild Yak, Takin, and Water Buffalo by comparing their full genomes, protein domains, and gene families.
Identify genetic traits linked to high-altitude survival using comparative genomics and functional annotations.
To perform a comprehensive comparative genomics analysis of high-altitude bovids, we curated and processed full genome assemblies and annotations for three target species: Wild Yak (Bos mutus), Takin (Budorcas taxicolor), and Water Buffalo (Bubalus bubalis). These datasets were downloaded from the NCBI Assembly and GenBank/RefSeq repositories, using the most recent high-quality assemblies.
Each genome dataset from NCBI contains multiple standard annotation files:
File Type | Format | Description |
---|---|---|
.genomic.fna |
FASTA | Whole-genome nucleotide sequence |
cds_from_genomic.fna |
FASTA | Coding sequences (CDS) |
genomic.gbff |
GBFF | GenBank flat file with annotations |
genomic.gff / .gtf |
GFF/GTF | Gene feature coordinates |
*.faa (derived) |
FASTA | Translated protein sequences |
Genome file sizes (compressed/uncompressed):
Species | Compressed (GB) | Uncompressed (GB) |
---|---|---|
Wild Yak | 3.60 GB | 12.10 GB |
Takin | 3.73 GB | 12.60 GB |
Water Buffalo | 3.63 GB | 12.70 GB |
To support analysis, original files were parsed and converted into structured, readable formats.
Derived from .gbff
files. Each row corresponds to a gene or feature entry.
Field | Description |
---|---|
Contig | ID of the chromosome/contig |
Feature_Type | Type (gene, ncRNA, source, etc.) |
Start/End | Genomic coordinates |
Strand | +1 or -1 orientation |
Locus_Tag | Unique identifier |
Gene | Gene name |
Product | RNA/protein description |
Protein_ID | Protein identifier |
Translation | Amino acid sequence (if applicable) |
Note | Comments or method used |
File size: 35β75 MB
Derived from the genomic features CSV. Used in downstream tools like InterProScan.
- FASTA format (
>XP_XXXXXX...
headers with protein sequence) - File size: 18β46 MB
Generated from InterProScan runs on protein sequences. Includes functional domains and pathway annotations.
Example:
XP_052517070.1 ... SSF144270 Eferin C-domain ... IPR037245 ... Reactome:R-HSA-432040
- File size: 3.75β10 GB
Subset of genomic features containing only CDS
, gene
, and mRNA
entries.
- Extracted from the original genomic features CSV.
- File size: 22β57 MB
Concatenated amino acid alignments of orthologous proteins.
Field | Description |
---|---|
Species ID | FASTA headers like >Takin , >Yak , >Buffalo |
Amino Acid Seq | MAFFT-aligned, concatenated ortholog sequences |
Sequence Length | Total length of concatenated orthologs |
Alignment Gaps | Represented by - for alignment |
- Fast estimation of genetic distance between species using k-mer sketches
- Output:
highres_distances.tsv
- Yak & Buffalo are genetically closer than Takin
- Protein sequences translated from
.gbff
- InterProScan run to detect domains, motifs, and GO terms
- Output:
.tsv
with domain annotations for each species
- Extracted biological processes and molecular functions per species
- Identified unique and shared gene functions
- All-vs-all comparison of proteomes
- Output: Ortholog clusters shared across species
- Helps detect species-specific vs. core gene families
- Grouped genes by function (e.g., immune, signaling, cytoskeleton)
- Compared counts across species to detect expansion/loss trends
- Built from MAFFT-aligned orthologous proteins
- Confirmed evolutionary distance (Takin most divergent)
Species | Genetic Focus |
---|---|
Wild Yak | Cytoskeletal proteins, RNA-binding, cold response |
Takin | Immune expansion, structural and ECM genes |
Water Buffalo | Broad sensory, immune, growth & stress genes |
- Yak vs Buffalo: 97.13%
- Takin vs Buffalo: 94.80%
- Yak vs Takin: 94.56%
- Yak: PDZ, RRM, Spectrin (cold/hypoxia adaptations)
- Takin: Immunoglobulin, Fibronectin, ECM, GPCR
- Buffalo: Richest domain diversity (reproduction, immunity)
This part of the project models current and future habitat suitability for Wild Yak and Takin using geospatial and climate data. It applies machine learning to predict where these animals can survive based on environmental conditions.
Predict species range shifts from 2009 to 2050 using environmental variables and occurrence records.
High-resolution environmental variables used to model habitat suitability for Wild Yak and Takin.
- TerraClimate (2009β2024) β https://www.climatologylab.org/terraclimate.html
- WorldClim 2050 SSP245 & SSP585 β https://www.worldclim.org/data/cmip6/cmip6_clim2.5m.html
- Google Earth Engine DEM β https://developers.google.com/earth-engine/datasets
- Natural Earth Landmask β https://www.naturalearthdata.com
- Precipitation: TerraClimate
.nc
files (2009β2024), annual sum, stacked. - Min Temperature: TerraClimate
.nc
files (2009β2024), annual mean. - Max Temperature: TerraClimate
.nc
files (2009β2024), annual mean. - Future Climate: WorldClim SSP245 and SSP585 for 2050.
- Elevation: Merged
.tif
from Earth Engine, resampled with GDAL warp. - Landmask: Rasterized from Natural Earth shapefile.
- Download monthly ppt, tmin, tmax NetCDF files.
- Aggregate annual values using
xarray
. - Merge and resample elevation
.tif
files using GDAL warp. - Rasterize Asia land shapefile to create landmask.
- Align all layers to the same spatial grid.
Species presence data used for SDM modeling.
-
Species:
- Wild Yak: 366 records
- Takin: 692 records
-
Columns:
- Longitude, Latitude
- Station Name, Climate ID, Date/Time, Year, Month, Day
- Max/Min/Mean Temp (Β°C)
- Heat/Cool Degree Days (Β°C)
- Total Rain (mm), Total Snow (cm), Total Precip (mm)
- Snow on Ground (cm)
- Wind Gust Direction and Speed
- Data Quality Flags
-
Data cleaned and spatially jittered.
-
Combined with environmental layers for model input.
- Downloaded and cleaned species presence data (lat/lon, date).
- Applied spatial jittering to reduce location bias:
- Wild Yak: 10 synthetic points per record
- Takin: 2 synthetic points per record
- Climate variables:
- Total Precipitation
- Minimum Temperature
- Maximum Temperature
- Elevation (resampled to climate resolution)
- Pseudo-absence points generated randomly.
- Algorithm Used: Random Forest Classifier (
scikit-learn
) - Training/Test Split: 70/30
- Evaluation Metrics: ROC-AUC, confusion matrix
- Best ROC-AUC:
- Wild Yak: 0.999
- Takin: 0.98+
- Suitability scores from 0 to 1 generated for each year (2009β2024).
- Future projections mapped using SSP245 and SSP585 climate scenarios (2050).
- Threshold (0.5) used to classify presence/absence.
- Habitat centroids calculated annually to track spatial shifts.
Species | Trend | Elevation Shift | Centroid Movement |
---|---|---|---|
Wild Yak | Habitat shrinks by 2050 | 4750m β ~4810m | NW by ~110 km |
Takin | Habitat expands by 2050 | Increase expected | W by ~121 km |
Wild Yak:
SDM/
βββ Code/
β βββ SDM_Final.ipynb # Core modeling notebook
βββ Data/
β βββ takin_Final_cleaned.xls
β βββ wild_yak_Final_cleaned.xls
β βββ elevation_resampled_to_climate.tif
βββ Output/
β βββ sdm_takin/
β β βββ suitability_map_20XX.png, .npy
β β βββ centroid_shifts_takin.csv
β β βββ takin_suitability_area_trend.png
β βββ sdm_yak/
β βββ suitability_map_20XX.png, .npy
β βββ centroid_shifts.csv
β βββ yak_suitability_area_trend_final.png
cd SDM/Code
jupyter notebook SDM_Final.ipynb
Make sure to have the following Python packages installed:
pip install scikit-learn rasterio xarray numpy pandas matplotlib
Gene_Feature_Extraction/
βββ 1_genomic_feature_extraction/ # Extract CSV from GBFF
βββ 2_overview_of_features/ # Plot gene feature stats
βββ 3_gene_grouping/ # Compare gene families
βββ 4_protein_translation/ # Extract & convert to FASTA
βββ 5_protein_extraction_and_analysis/ # Run & analyze InterProScan
βββ 6_GOandKegg_Pathways/ # Enrich and cluster GO terms
βββ 7_Gene_extraction/ # Extract CDS-only features
βββ 8_gene_visualization/ # Plot gene product overlaps
βββ 9-ProteinOrtho-Orthologs_analysis/ # Core ortholog clustering
βββ 10-Phylogenetic_Tree_ortholog/ # Build & visualize tree
interproscan.sh -i species_proteins.fasta -o output.tsv -f TSV
python gene_grouping.py
proteinortho5.pl *.faa > myproject.proteinortho.tsv
python extract_output_heatmap.py
- Python: Data processing and visualization
- WSL / Linux: Running heavy tools like InterProScan, Mash
- R: Functional clustering, GO enrichment
- ProteinOrtho, MAFFT, IQTree: For phylogeny and orthologs