Skip to content

Git-based viral taxonomy management system - track changes, migrate datasets, and cite specific versions. Website: https://shandley.github.io/ICTV-git/

License

Notifications You must be signed in to change notification settings

shandley/ICTV-git

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

39 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ICTV-git: Git-based Viral Taxonomy Management

ICTV-GIT Logo

License: MIT Python 3.8+ Data: ICTV MSL Research: Virology Status: Active DOI

🦠 Revolutionizing Viral Taxonomy with Version Control

ICTV-git transforms the International Committee on Taxonomy of Viruses (ICTV) classification system into a transparent, versioned, and community-driven platform using git version control principles. This solves the reproducibility crisis in virology research by enabling researchers to track taxonomic changes, migrate datasets between versions, and cite specific taxonomy versions.

πŸš€ Current Status: Phase 3 Complete

  • βœ… Phase 1 - Family Size Analysis: Evidence-based guidelines for viral family management
  • βœ… Phase 2 - Temporal Evolution: Multi-rank growth patterns and taxonomic stability analysis
  • βœ… Phase 3 - Discovery Method Evolution: Technology-driven discovery paradigm shifts
  • βœ… Real Data Framework: Strict policy ensuring all analyses use only documented ICTV statistics
  • βœ… 12 Publication-Ready Visualizations: Comprehensive plots documenting viral taxonomy evolution

🎯 The Problem We Solve

Current viral taxonomy management suffers from:

  • Breaking changes without migration paths - The Caudovirales reclassification eliminated 50+ years of ecological data associations
  • Version incompatibility - 18 MSL releases since 2005, each incompatible with the others
  • Lost institutional knowledge - When families split, historical reasoning disappears
  • Reproducibility crisis - Papers published months apart use incompatible taxonomies

✨ Completed Research Analyses

Phase 1: Family Size Analysis

  • πŸ“Š Optimal Size Guidelines: 50-300 species per family based on 20-year patterns
  • πŸ” Caudovirales Case Study: 1,847 species reorganization from 3 to 15 families
  • πŸ“ˆ Growth Metrics: 14.8x species increase with 15.2% annual growth
  • ⚠️ Crisis Thresholds: Families >1,000 species require immediate action

Phase 2: Temporal Evolution Analysis

  • πŸ“ˆ Multi-Rank Evolution: Species (14.8x), Genera (15.5x), Families (4.5x) growth
  • πŸš€ 5 Acceleration Periods: Technology-driven growth spikes (up to 79.7% in 2017)
  • πŸ“Š Stability Analysis: Families most stable (CV=0.425), species least stable (CV=0.512)
  • πŸ”„ Technology Eras: Pre-NGS β†’ NGS β†’ Metagenomics β†’ AI (2005-2024)

Phase 3: Discovery Method Evolution

  • πŸ”¬ 4 Discovery Eras: Culture β†’ Molecular β†’ Metagenomics β†’ AI-assisted
  • πŸ“ˆ 34x Discovery Rate Increase: From 125 to 3,297 species/year
  • 🌍 Paradigm Shift: 90% pathogen-focused β†’ 70% environmental-focused
  • πŸ’° 200x Cost Reduction: $10,000 β†’ $50 per genome enabling mass discovery

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/shandley/ICTV-git.git
cd ICTV-git

# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install matplotlib pandas numpy pyyaml

Run Research Analyses

# Phase 1: Family Size Analysis
python research/family_size_analysis/basic_analysis.py
python research/family_size_analysis/create_matplotlib_plots.py

# Phase 2: Temporal Evolution Analysis
python research/temporal_evolution_analysis/temporal_analysis.py
python research/temporal_evolution_analysis/create_temporal_plots.py

# Phase 3: Discovery Method Evolution
python research/discovery_method_evolution/discovery_method_analysis.py
python research/discovery_method_evolution/create_discovery_plots.py

Key Outputs

  • Analysis Results: JSON files in each phase's results/ directory
  • Publication Plots: 12 total plots (4 per phase) as PNG/PDF files
  • Research Reports: Comprehensive findings documents for each phase

πŸ“Š Research Results

Key Discoveries from Real ICTV Data (2005-2024)

Our analysis of 20 years of ICTV Master Species Lists revealed:

  • 14.8x Growth: Viral species increased from 1,950 (MSL23, 2005) to 28,911 (MSL40, 2024)
  • 15.2% Annual Growth: Consistent exponential growth driven by technology advances
  • Technology Acceleration: Major growth spurts during sequencing cost reduction (2012), metagenomics revolution (2017), and COVID-19 response (2021)
  • Caudovirales Dissolution: Largest reorganization in ICTV history - 1,847 species split from 3 families into 15 families (2021)
  • Family Size Crisis: Evidence-based framework showing families >1,000 species require immediate reorganization

πŸŽ“ Research Applications

Family Size Management

  • Evidence-based Guidelines: Optimal family sizes (50-300 species) based on real ICTV patterns
  • Crisis Prevention: Early warning system for families approaching instability (>1,000 species)
  • Reorganization Planning: Learn from Caudovirales dissolution to plan future taxonomy changes

Viral Discovery Analysis

# Example: Load and analyze real ICTV growth data
import json

with open('research/family_size_analysis/results/family_size_analysis_basic.json', 'r') as f:
    data = json.load(f)

growth_data = data['growth_analysis']['growth_data']
for year_data in growth_data:
    print(f"{year_data['year']}: {year_data['species_count']:,} species")

Publication-Ready Visualizations

  • Growth Trajectory: 20-year exponential species growth with technology milestones
  • Acceleration Analysis: Technology-driven discovery periods with real growth rates
  • Caudovirales Timeline: Before/after visualization of largest viral taxonomy reorganization
  • Management Framework: Evidence-based family size guidelines with crisis zone identification

πŸ”¬ Research Methodology

Data Integrity Guarantee

  • Real Data Only: All analyses use exclusively documented ICTV Master Species List statistics
  • Zero Mock Data: Comprehensive elimination of any simulated or synthetic data
  • Source Verification: Every finding traceable to official ICTV publications
  • Validation Pipeline: Multi-stage verification ensuring research integrity

Research Progress

  • βœ… Phase 1: Family Size Analysis - COMPLETE
  • βœ… Phase 2: Temporal Evolution Analysis - COMPLETE
  • βœ… Phase 3: Discovery Method Evolution - COMPLETE
  • πŸ”„ Next Phase: Full MSL parsing for git repository creation
  • πŸ“… Future Work: Migration tools, semantic diffs, and community platform

πŸ› οΈ Current Project Structure

ICTV-git/
β”œβ”€β”€ research/                              # Research analysis modules
β”‚   β”œβ”€β”€ family_size_analysis/             # Phase 1: Family size analysis
β”‚   β”‚   β”œβ”€β”€ basic_analysis.py             # Core analysis with real ICTV data
β”‚   β”‚   β”œβ”€β”€ create_matplotlib_plots.py    # Publication visualizations
β”‚   β”‚   β”œβ”€β”€ results/                      # 4 plots + analysis data
β”‚   β”‚   └── REAL_DATA_FINDINGS.md         # Comprehensive findings
β”‚   β”œβ”€β”€ temporal_evolution_analysis/      # Phase 2: Temporal patterns
β”‚   β”‚   β”œβ”€β”€ temporal_analysis.py          # Multi-rank evolution analysis
β”‚   β”‚   β”œβ”€β”€ create_temporal_plots.py      # Growth and stability plots
β”‚   β”‚   β”œβ”€β”€ results/                      # 4 plots + temporal data
β”‚   β”‚   └── TEMPORAL_EVOLUTION_FINDINGS.md # Evolution findings
β”‚   β”œβ”€β”€ discovery_method_evolution/       # Phase 3: Discovery methods
β”‚   β”‚   β”œβ”€β”€ discovery_method_analysis.py  # Method contribution analysis
β”‚   β”‚   β”œβ”€β”€ create_discovery_plots.py     # Technology impact plots
β”‚   β”‚   β”œβ”€β”€ results/                      # 4 plots + method data
β”‚   β”‚   └── DISCOVERY_METHOD_EVOLUTION_FINDINGS.md # Method findings
β”‚   └── MOCK_DATA_ARCHIVE/                # Archived mock data (not used)
β”œβ”€β”€ manuscript_findings_v3.json           # Consolidated research findings
β”œβ”€β”€ CLAUDE.md                             # Project development guidelines
└── README.md                             # This file

πŸ“š Documentation

Research Documentation

Project Documentation

🀝 Contributing

We welcome contributions to expand this research! Current priorities:

  • MSL Data Parsing: Help build parsers for all 18 MSL Excel files (2005-2024)
  • Git Conversion Tools: Create tools to convert parsed taxonomy into git repositories
  • Additional Analyses: Apply methodology to other viral taxonomy questions
  • Visualization Improvements: Enhance publication-quality plots and add new analysis types
  • Data Validation: Help verify and cross-check ICTV statistics across sources

πŸ“„ Citation

If you use ICTV-git analyses in your research, please cite:

@software{ictv-git,
  author = {Handley, Scott},
  title = {ICTV-git: Comprehensive Analysis of Viral Taxonomy Evolution},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://github.com/shandley/ICTV-git},
  note = {Three-phase analysis of ICTV data (2005-2024): family size dynamics, temporal evolution, and discovery method impacts}
}

Research manuscript in preparation examining 20 years of viral taxonomy evolution through git-based analysis.

πŸ”— Links

License

This project is licensed under the MIT License - see LICENSE for details. ICTV data is used under Creative Commons license as specified by ICTV.

Contact

Acknowledgments

  • International Committee on Taxonomy of Viruses (ICTV) for providing open access to MSL data
  • The virology community for feedback and use cases
  • Git and open source community for inspiration

Transforming viral taxonomy from static documents to dynamic, versioned data

About

Git-based viral taxonomy management system - track changes, migrate datasets, and cite specific versions. Website: https://shandley.github.io/ICTV-git/

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Sponsor this project

Packages

No packages published

Contributors 2

  •  
  •