Skip to content

Shows how Wild Yak, Takin, and Water Buffalo adapt to climate change using data-driven habitat modeling, spatial mapping, and genomic analysis. Combines machine learning and biology to uncover patterns in where these animals live and how their genes respond to the environment - insights that support real-world conservation decisions.

Notifications You must be signed in to change notification settings

SathyaV99/Geospatial-Modeling-and-Genomic-Analysis-for-Climate-Adaptation

Repository files navigation

Genomic and Geographic Adaptations to Climate Change: A Comparative Study of Wild Yak, Takin, and High-Altitude Bovids

Wild Yak faces substantial habitat loss and altitudinal displacement by 2050, driven by climate change and limited adaptive genomic traits.

Computationally intensive analyses like InterProScan and BLAST averaged 2–4 days each for a single species to complete!

This project investigates how high-altitude bovids-Wild Yak, Takin, and Water Buffalo-adapt to climate change. It combines species distribution modeling and comparative genomics to find out:

  • Where their habitats are now and where they'll likely shift in the future

  • What genes and protein functions help them survive in extreme environments

It uses Python and R with machine learning (Random Forest), spatial analysis (centroid tracking, jittering), and genomic tools like Mash, InterProScan, ProteinOrtho, and GO enrichment.

image

Key finding for SDM:

  • Wild yak habitats are shrinking and moving uphill. Takin habitats are expanding. These results help guide future conservation efforts.

image

Key finding for Genomic Analysis:

  • Wild yaks have fewer heat shock genes, and due to their thick fur, they struggle to regulate body heat. They are adapted to cold climates at elevations around 3,000 feet and cannot tolerate warmer temperatures.
immune_genes

🐝 Genomic Analysis

This part of the project investigates genetic adaptations of Wild Yak, Takin, and Water Buffalo by comparing their full genomes, protein domains, and gene families.


πŸ”Έ Objective

Identify genetic traits linked to high-altitude survival using comparative genomics and functional annotations.


πŸ“¦ Data Sources

To perform a comprehensive comparative genomics analysis of high-altitude bovids, we curated and processed full genome assemblies and annotations for three target species: Wild Yak (Bos mutus), Takin (Budorcas taxicolor), and Water Buffalo (Bubalus bubalis). These datasets were downloaded from the NCBI Assembly and GenBank/RefSeq repositories, using the most recent high-quality assemblies.

πŸ”— Genome Assembly Links


πŸ“ Genome File Types and Attributes

i) Original Datasets

Each genome dataset from NCBI contains multiple standard annotation files:

File Type Format Description
.genomic.fna FASTA Whole-genome nucleotide sequence
cds_from_genomic.fna FASTA Coding sequences (CDS)
genomic.gbff GBFF GenBank flat file with annotations
genomic.gff / .gtf GFF/GTF Gene feature coordinates
*.faa (derived) FASTA Translated protein sequences

Genome file sizes (compressed/uncompressed):

Species Compressed (GB) Uncompressed (GB)
Wild Yak 3.60 GB 12.10 GB
Takin 3.73 GB 12.60 GB
Water Buffalo 3.63 GB 12.70 GB

ii) Derived Datasets

To support analysis, original files were parsed and converted into structured, readable formats.

a) Genomic Features Dataset (*_genomic_features.csv)

Derived from .gbff files. Each row corresponds to a gene or feature entry.

Field Description
Contig ID of the chromosome/contig
Feature_Type Type (gene, ncRNA, source, etc.)
Start/End Genomic coordinates
Strand +1 or -1 orientation
Locus_Tag Unique identifier
Gene Gene name
Product RNA/protein description
Protein_ID Protein identifier
Translation Amino acid sequence (if applicable)
Note Comments or method used

File size: 35–75 MB


b) Protein FASTA Dataset (*_proteins.fasta)

Derived from the genomic features CSV. Used in downstream tools like InterProScan.

  • FASTA format (>XP_XXXXXX... headers with protein sequence)
  • File size: 18–46 MB

c) InterProScan Annotation Dataset (*_interpro.tsv)

Generated from InterProScan runs on protein sequences. Includes functional domains and pathway annotations.

Example:

XP_052517070.1 ... SSF144270 Eferin C-domain ... IPR037245 ... Reactome:R-HSA-432040
  • File size: 3.75–10 GB

d) Genomic CDS Features Dataset (*_CDS.csv)

Subset of genomic features containing only CDS, gene, and mRNA entries.

  • Extracted from the original genomic features CSV.
  • File size: 22–57 MB

e) SuperMatrix Dataset (core_orthologs_supermatrix.fasta)

Concatenated amino acid alignments of orthologous proteins.

Field Description
Species ID FASTA headers like >Takin, >Yak, >Buffalo
Amino Acid Seq MAFFT-aligned, concatenated ortholog sequences
Sequence Length Total length of concatenated orthologs
Alignment Gaps Represented by - for alignment

πŸ› οΈ Methodology

1. Genome-Wide Similarity with Mash

  • Fast estimation of genetic distance between species using k-mer sketches
  • Output: highres_distances.tsv
  • Yak & Buffalo are genetically closer than Takin

2. Protein Domain Analysis with InterProScan

  • Protein sequences translated from .gbff
  • InterProScan run to detect domains, motifs, and GO terms
  • Output: .tsv with domain annotations for each species

3. Functional Enrichment (GO)

  • Extracted biological processes and molecular functions per species
  • Identified unique and shared gene functions

4. Ortholog Detection with ProteinOrtho

  • All-vs-all comparison of proteomes
  • Output: Ortholog clusters shared across species
  • Helps detect species-specific vs. core gene families

5. Gene Grouping and Product Distribution

  • Grouped genes by function (e.g., immune, signaling, cytoskeleton)
  • Compared counts across species to detect expansion/loss trends

6. Phylogenetic Tree Construction

  • Built from MAFFT-aligned orthologous proteins
  • Confirmed evolutionary distance (Takin most divergent)

πŸ“ˆ Key Findings

Species Genetic Focus
Wild Yak Cytoskeletal proteins, RNA-binding, cold response
Takin Immune expansion, structural and ECM genes
Water Buffalo Broad sensory, immune, growth & stress genes

image

Genetic Distances (Mash)

  • Yak vs Buffalo: 97.13%
  • Takin vs Buffalo: 94.80%
  • Yak vs Takin: 94.56%

Domain Highlights

  • Yak: PDZ, RRM, Spectrin (cold/hypoxia adaptations)
  • Takin: Immunoglobulin, Fibronectin, ECM, GPCR
  • Buffalo: Richest domain diversity (reproduction, immunity)

image


🌍 Species Distribution Modeling (SDM)

This part of the project models current and future habitat suitability for Wild Yak and Takin using geospatial and climate data. It applies machine learning to predict where these animals can survive based on environmental conditions.


πŸ”Έ Objective

Predict species range shifts from 2009 to 2050 using environmental variables and occurrence records.


πŸ“¦ Data Sources

🌦️ Climate and Environmental Data

High-resolution environmental variables used to model habitat suitability for Wild Yak and Takin.

Data Sources


Environmental Data Summary

  • Precipitation: TerraClimate .nc files (2009–2024), annual sum, stacked.
  • Min Temperature: TerraClimate .nc files (2009–2024), annual mean.
  • Max Temperature: TerraClimate .nc files (2009–2024), annual mean.
  • Future Climate: WorldClim SSP245 and SSP585 for 2050.
  • Elevation: Merged .tif from Earth Engine, resampled with GDAL warp.
  • Landmask: Rasterized from Natural Earth shapefile.

Processing Steps

  • Download monthly ppt, tmin, tmax NetCDF files.
  • Aggregate annual values using xarray.
  • Merge and resample elevation .tif files using GDAL warp.
  • Rasterize Asia land shapefile to create landmask.
  • Align all layers to the same spatial grid.

πŸ“ Occurrence Data

Species presence data used for SDM modeling.

  • Species:

    • Wild Yak: 366 records
    • Takin: 692 records
  • Columns:

    • Longitude, Latitude
    • Station Name, Climate ID, Date/Time, Year, Month, Day
    • Max/Min/Mean Temp (Β°C)
    • Heat/Cool Degree Days (Β°C)
    • Total Rain (mm), Total Snow (cm), Total Precip (mm)
    • Snow on Ground (cm)
    • Wind Gust Direction and Speed
    • Data Quality Flags
  • Data cleaned and spatially jittered.

  • Combined with environmental layers for model input.


πŸ› οΈ Methodology

1. Data Preprocessing

  • Downloaded and cleaned species presence data (lat/lon, date).
  • Applied spatial jittering to reduce location bias:
    • Wild Yak: 10 synthetic points per record
    • Takin: 2 synthetic points per record
  • Climate variables:
    • Total Precipitation
    • Minimum Temperature
    • Maximum Temperature
    • Elevation (resampled to climate resolution)
  • Pseudo-absence points generated randomly.

2. Modeling

  • Algorithm Used: Random Forest Classifier (scikit-learn)
  • Training/Test Split: 70/30
  • Evaluation Metrics: ROC-AUC, confusion matrix
  • Best ROC-AUC:
    • Wild Yak: 0.999
    • Takin: 0.98+

3. Prediction & Mapping

  • Suitability scores from 0 to 1 generated for each year (2009–2024).
  • Future projections mapped using SSP245 and SSP585 climate scenarios (2050).
  • Threshold (0.5) used to classify presence/absence.
  • Habitat centroids calculated annually to track spatial shifts.

πŸ“ˆ Key Results

Species Trend Elevation Shift Centroid Movement
Wild Yak Habitat shrinks by 2050 4750m β†’ ~4810m NW by ~110 km
Takin Habitat expands by 2050 Increase expected W by ~121 km

Visulaization

Wild Yak:

image image


πŸ—‚οΈ File Structure

SDM/
β”œβ”€β”€ Code/
β”‚   └── SDM_Final.ipynb        # Core modeling notebook
β”œβ”€β”€ Data/
β”‚   β”œβ”€β”€ takin_Final_cleaned.xls
β”‚   β”œβ”€β”€ wild_yak_Final_cleaned.xls
β”‚   └── elevation_resampled_to_climate.tif
β”œβ”€β”€ Output/
β”‚   β”œβ”€β”€ sdm_takin/
β”‚   β”‚   β”œβ”€β”€ suitability_map_20XX.png, .npy
β”‚   β”‚   β”œβ”€β”€ centroid_shifts_takin.csv
β”‚   β”‚   └── takin_suitability_area_trend.png
β”‚   └── sdm_yak/
β”‚       β”œβ”€β”€ suitability_map_20XX.png, .npy
β”‚       β”œβ”€β”€ centroid_shifts.csv
β”‚       └── yak_suitability_area_trend_final.png

▢️ How to Run

cd SDM/Code
jupyter notebook SDM_Final.ipynb

Make sure to have the following Python packages installed:

pip install scikit-learn rasterio xarray numpy pandas matplotlib

πŸ—‚οΈ File Structure

Gene_Feature_Extraction/
β”œβ”€β”€ 1_genomic_feature_extraction/       # Extract CSV from GBFF
β”œβ”€β”€ 2_overview_of_features/             # Plot gene feature stats
β”œβ”€β”€ 3_gene_grouping/                    # Compare gene families
β”œβ”€β”€ 4_protein_translation/              # Extract & convert to FASTA
β”œβ”€β”€ 5_protein_extraction_and_analysis/  # Run & analyze InterProScan
β”œβ”€β”€ 6_GOandKegg_Pathways/               # Enrich and cluster GO terms
β”œβ”€β”€ 7_Gene_extraction/                  # Extract CDS-only features
β”œβ”€β”€ 8_gene_visualization/               # Plot gene product overlaps
β”œβ”€β”€ 9-ProteinOrtho-Orthologs_analysis/  # Core ortholog clustering
β”œβ”€β”€ 10-Phylogenetic_Tree_ortholog/      # Build & visualize tree

▢️ How to Run

InterProScan:

interproscan.sh -i species_proteins.fasta -o output.tsv -f TSV

Gene Grouping:

python gene_grouping.py

Ortholog Detection:

proteinortho5.pl *.faa > myproject.proteinortho.tsv

Plot Heatmaps:

python extract_output_heatmap.py

πŸ”§ Tools Used

  • Python: Data processing and visualization
  • WSL / Linux: Running heavy tools like InterProScan, Mash
  • R: Functional clustering, GO enrichment
  • ProteinOrtho, MAFFT, IQTree: For phylogeny and orthologs

About

Shows how Wild Yak, Takin, and Water Buffalo adapt to climate change using data-driven habitat modeling, spatial mapping, and genomic analysis. Combines machine learning and biology to uncover patterns in where these animals live and how their genes respond to the environment - insights that support real-world conservation decisions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages